list<item: float> not supported in join non-key field? - pyarrow

I am trying to join 2 Arrow tables where some columns are of list<float> data type. Note that my join columns/keys are primitive data types and some my non-join columns/keys are of list<float>. But, PyArrow join() cannot join such as table, although pandas can. It says
ArrowInvalid: Data type list<item: float> is not supported in join non-key field
when I execute this piece of code
joined_table = table_1.join(table_2, ['k1', 'k2', 'k3'])
Any idea on how to fix this issue or get around this would be helpful. Thanks.

I think currently PyArrow join doesn't support some column types. See criteria for allowed types here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/hash_join_node.cc#L48
I believe the issue is (but I might be wrong) that list is not a fixed-width type and cannot be processed currently. You might want to open a Jira about this.

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

Join returns NULL when data that matches is in the table

I'm trying to get results when both tables have the same machine number and there are entries that have the same number in both tables.
Here is what I've tried:
SELECT fehler.*,
'maschine.Maschinen-Typ',
maschine.Auftragsnummer,
maschine.Kunde,
maschine.Liefertermin_Soll
FROM fehler
JOIN maschine
ON ltrim(rtrim('maschine.Maschinen-Nr')) = ltrim(rtrim(fehler.Maschinen_Nr))
The field I'm joining on is a varchar in both cases. I tried without trims but still returns empty
I'm using MariaDB (if that's important).
ON ltrim(rtrim('maschine.Maschinen-Nr')) = ltrim(rtrim(fehler.Maschinen_Nr)) seems wrong...
Is fehler.Maschinen_Nr really the string 'maschine.Maschinen-Nr'?
SELECT fehler.*, `maschine.Maschinen-Typ`, maschine.Auftragsnummer, maschine.Kunde, maschine.Liefertermin_Soll
FROM fehler
JOIN maschine
ON ltrim(rtrim(`maschine.Maschinen-Nr`)) = ltrim(rtrim(`fehler.Maschinen_Nr`))
Last line compared a string to a number. This should be doing it.
Also, use the backtick to reference the column names.
The single quotes are string delimiters. You are comparing fehler.Maschinen_Nr with the string 'maschine.Maschinen-Nr'. In standard SQL you would use double quotes for names (and I think MariaDB allows this, too, certain settings provided). In MariaDB the commonly used name qualifier is the backtick:
SELECT fehler.*,
`maschine.Maschinen-Typ`,
maschine.Auftragsnummer,
maschine.Kunde,
maschine.Liefertermin_Soll
FROM fehler
JOIN maschine
ON trim(`maschine.Maschinen-Nr`) = trim(fehler.Maschinen_Nr)
(It would be better of course not to use names with a minus sign or other characters that force you to use name delimiters in the first place.)
As you see, you can use TRIM instead of LTRIM and RTRIM. It would be better, though, not to allow space at the beginning or end when inserting data. Then you wouldn't have to remove them in every query.
Moreover, it seems Maschinen_Nr should be primary key for the table maschine and naturally a foreign key then in table fehler. That would make sure fehler doesn't contain any Maschinen_Nr that not exists exactly so in maschine.
To avoid this problems in future, the convention for DB's is snake case(lowercase_lowercase).
Besides that, posting your DB schema would be really helpfull since i dont guess your data structures.
(For friendly development, is usefull that variables, tables and columns should be written in english)
So with this, what is the error that you get, because if table "maschine" has a column named "Maschinen-Nr" and table "fehler" has a column named "Maschinen_Nr" and the fields match each other, it should be correct
be careful with Maschinen-Nr and Maschinen_Nr. they have - and _ on purpose?
a very blind solution because you dont really tell what is your problem or even your schema is:
SELECT table1Alias.*, table2Alias.column_name, table2Alias.column_name
FROM table1 [table1Alias]
JOIN table2 [table2Alias]
ON ltrim(rtrim(table1Alias.matching_column)) = ltrim(rtrim(table2Alias.matching_column))
where matching_columns are respectively PK and FK or if the data matches both columns [] are optional and if not given, will be consider table_name

How to write a code to convert text into a number?

The database I'm working on has a field in one table as a text whereas the other table has the field in a number format. I cannot change the field format at all in the database. Therefore I need to know how to convert the field from text to number before linking (or join) the tables to pull the data.
SELECT DISTINCT tblCoachingDB.ID, tblCoachingDB.SourceId, tblCoachingDBSource.ID
FROM tblCoachingDB, tblCoachingDBSource
WHERE (((tblCoachingDB.SourceId)="12"));
The tblCoachingDB.SourceID is a TEXT whereas the tblCoachingDBSource.ID is a NUMBER
You can use CStr() to cast a number as text and JOIN that to another text field.
SELECT DISTINCT
tblCoachingDB.ID,
tblCoachingDB.SourceId,
tblCoachingDBSource.ID
FROM
tblCoachingDB INNER JOIN tblCoachingDBSource
ON tblCoachingDB.SourceId = CStr(tblCoachingDBSource.ID)
WHERE tblCoachingDB.SourceId='12';
Actually I would leave out the WHERE clause until after you confirm the JOIN works properly.
You originally asked to JOIN by converting the text field to number. I first suggested text instead because I recall Access was less likely to object. But my memory about that is shaky, and if you want numeric for both sides of the JOIN, see which of these (if any) works best for you:
ON Int(tblCoachingDB.SourceId) = tblCoachingDBSource.ID
ON CLng(tblCoachingDB.SourceId) = tblCoachingDBSource.ID
ON Val(tblCoachingDB.SourceId) = tblCoachingDBSource.ID
Note I offered this suggestion only because you told us you are not permitted to alter your tblCoachingDB table's design to make SourceId numeric instead of text datatype. Since you can't make that change, you will have to live with the run-time performance impact of converting the datatype of a JOIN field. That is not a good thing, but I don't know how bad it will be. Good luck.
Assuming that all values in tblCoachingDB.SourceID are numbers, you could create a query, selecting all fields from tblCoachingDB EXCEPT SourceID. Then add a new field to the query SourceID: clng(tblCoachingDB.SourceID)
You would then use the query instead of tblCoachingDB anywhere you needed to make the join. A second alternative would be to create a query for tblCoachingDBSource and using SourceID: cstr(tblCoaching.SourceID) A third alternative would be:
SELECT * FROM tblCoachingDB, tblCoachingDBSource
WHERE (clng(tblCoachingDB.SourceId)=tblCoachingDBSource.ID
AND ((tblCoachingDB.SourceId)="12"));

Left join table ON row with JSON values?

This one is tough , I have 2 tables that I need to join on specific row and issue is that first table row is json value
this is the json row from table items
[{"id":"15","value":"News Title"},{"id":"47","value":"image1.jpg"},{"id":"33","value":"$30"}]
this is the table attributes that I need to join on json ID and get the actual attribute name like Title , Image , Price
id Name
15 Title
47 Image
33 Price
so the start is
SELECT item_values FROM ujm_items
LEFT JOIN?????
WHERE category = 8 AND published = 1 ORDER BY created DESC
but left join on json , have no clue.
any help is appreciated.
... and this is why you don't store structured data in a single SQL field. It negates the whole purpose of a relational database.
Unless you've got a DB that includes a JSON parser, you've got two options:
a) unreliable string operations to find/extract a particular key/value pair
b) slurp the json into a client which CAN parse back to native, extract the key/values you want, then use some other ID field for the actual joins.
SELECT ...
LEFT JOIN ON SUBSTR(jsonfield, POSITION('"id:"', jsonfield)) etc...
Either way, it utterly torpedoes performance since you can't use indexes on these calculated/derived values.
note that this won't work as is - it's just to demonstrate how utterly ugly this gets.
Fix your tables - normalize the design and don't store JSON data that you need to extract data from. It's one thing to put in a json string that you'll only ever fetch/update in its entirely. It's a completely different thing to have one you need to join on sub-values thereof.