SQL to equivalent pandas - Merge on columns where column is null - mysql

I opened up this new question because I'm not sure the user's request and wording matched each other: pandas left join where right is null on multiple columns
What is the equivalent pandas code to this SQL? Contextually we're finding entries from a column in table_y that aren't in table_x with respect to several columns.
SELECT
table_x.column,
table_x.column2,
table_x.column3,
table_y.column,
table_y.column2,
table_y.column3,
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
Is this it?
columns_join = ['column', 'column2']
data_y = data_y.set_index(columns_join)
data_x = data_x.set_index(columns_join)
data_diff = pandas.concat([data_x, data_y]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset - in case datasets are too large
x1 = data_x[data_x.index.isin(data_diff.index)]
x2 = data_y[data_y.index.isin(data_diff.index)]
# Perform an outer join with the joined indices from each set,
# then remove the entries only contributed from table_x
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True)
data_compare_final = (
data_compare
.query('_merge == left_join')
.drop('_merge', axis=1)
)
I don't think that's equivalent because we only removed entries from table_x that aren't in the join based on multiple columns. I think we have to continue and compare the column against table_y.
data_compare = data_compare.reset_index().set_index('column2')
data_y = data_y.reset_index().set_index('column2')
mask_column2 = data_y.index.isin(data_compare.index)
result = data_y[~mask_column2]

Without test data it is a bit difficult to be sure that this helps but you can try:
# Only if columns to join on in the right dataframe have the same name as columns in left
table_y[['col_join_1', 'col_join_2']] = table_y[['column', 'column2']] # Else this is not needed
# Merge left (LEFT JOIN)
table_merged = table_x.merge(
table_y,
how='left',
left_on=['column', 'column2'],
right_on=['col_join_1', 'col_join_2'],
suffixes=['_x', '_y']
)
# Filter dataframe
table_merged = table_merged.loc[
table_merged.column2_y.isna(),
['column_x', 'column2_x', 'column3_x', 'column_y', 'column2_y', 'column3_y']
]

I found an equivalent that amounts to setting the index to the join column(s), union'ing the tables, dropping the duplicates, and performing a cross join between the contributions to the union. From there, one can select
left_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
right_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_y
LEFT JOIN table_x
ON table_y.column = table_x.column
ON table_y.column2 = table_x.column2
WHERE
table_x.column2 is NULL
def create_dataframe_joined_diffs(dataframe_prod, dataframe_new, columns_join):
"""
Set the indices to the columns_key
Concat the dataframes and remove duplicates
Select the diff representative from each dataset
Reset the indices and perform an outer join
Pseudo-SQL:
SELECT
UNIQUE(*)
FROM dataframe_prod
OUTER JOIN dataframe_new
ON columns_join
"""
data_new = dataframe_new.set_index(columns_join)
data_prod = dataframe_prod.set_index(columns_join)
# Get any row not in both (may be removing too many)
data_diff = pandas.concat([data_prod, data_new]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset
x1 = data_prod[data_prod.index.isin(data_diff.index)]
x2 = data_new[data_new.index.isin(data_diff.index)]
# Perform an outer join and keep the joined indices from each set
# Sort the columns to make them easier to compare
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True).sort_index(axis=1)
return data_compare
mask_left = dataframe_compare['_merge'] == 'left_only'
mask_right = dataframe_compare['_merge'] == 'right_only'

Related

join fields with numbers

Some books have more than one author, I have a table with with book_id and author_id1, author_id2, author_id3, and author_id4. I have a table with author_id and author_name.
How can I join these two tables and the main table with book_id to get the authors names together in a data row from a sql query join.
Example:
SELECT book.book_id, book.title, author.author, book.location
FROM books AS b JOIN book_authors AS ba ON b.book_id = ba.book_id JOIN authors AS a ON REGEX ba.authors_id$ = a.authors_id
Not sure about REGEX ($) use in sql Should display id, title, authors, location
How do I get all authors_id# to match authors_id ( notice one has number at end other does not)?
update: So, I would like to get book_authors.authors_id1 to match authors.authors_id, book_authors.authors_id2 to match authors.authors_id, book_authors.authors_id3 to match authors.authors_id, book_authors.authors_id4 to match authors.authors_id and return all the matching authors in list.
...
# merge book_authors and authors into one dataframe
ba_df.rename(columns= {'authors_id1': 'authors_id'}, inplace=True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name1', 'authors_id2': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name2', 'authors_id3': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name3', 'authors_id4': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name4'}, inplace = True)
...
Was working through another dataframe and got the idea to use map after rename to set_index the same on both dataframes. Now, the map lines can work, just have to rename the common column , so as not to overwrite, in this case it was authors_id, replaced with authors_name1, 2, 3 & 4, which equates to the authors_id1, 2, 3 & 4. And yes, it is not pure sql, but it works for python, which is where I had the problem.

select by columns multiple criteria

I have this db structure
and this is my joined tables (sample data)
i want to filter when (key = 'price' and value > 4000) and (key = 'top-speed' and value > 200)
thanks for help)
Try this:
SELECT
car.id,
car.name,
key,
value
FROM
car c
LEFT JOIN car_specification_value csv ON c.id = csv.car_id
LEFT JOIN specification s ON csv.specification_id = s.id
WHERE
s.key = 'price'
AND csv.value > 4000
AND s.key = 'top-speed'
AND csv.value > 200;
A common practice is giving aliases to your tables so the join conditions can be determined in a shorter way.
One method uses aggregation:
select csv.car_id
from car_specification_value csv
where (csv.key = 'price' and (csv.value + 0) > 4000) and
(csv.key = 'top-speed' and (csv.value + 0) > 200)
group by csv.car_id
having count(distinct csv.key) = 2; -- both match
Note the + 0. This uses implicit conversion to change the value to a number, so it can be properly compared to a number. One challenge of key/value data structures is that all the values are strings, and that is tricky for other data types.

MySQL query getting a string value from query result

I have a query as follows:
SELECT *
FROM tb_circulares LEFT JOIN tb_colegios ON tb_circulares.colegio_circular = tb_colegios.id_colegio
LEFT JOIN tb_circulares_clase ON tb_circulares.codigo_circular = tb_circulares_clase.circular
LEFT JOIN tb_clases ON tb_circulares_clase.clase = tb_clases.id_clase
WHERE colegio_circular = 17
The query output shows three rows.
One of the row columns is the value for the field tb_circular_clase.nombre_clase
I would like to get a string containing the three resulting values for tb_circular_clase.nombre_clase.
For example:
row1-> nombre_clase = "1º primaria"
row2-> nombre_clase = "3ª secundaria"
row3-> nombre_clase = "4º primaria"
Is it possible to get a resulting query field with the final value?
`resultado = "1º primaria - 3ª secundaria - 4º primaria"
Thanks
Sounds like you're after group_concat(), which is an aggregation function concatenating all values of a group.
SELECT group_concat(nombre SEPARATOR ' - ') resultado
FROM tb_circulares
LEFT JOIN tb_colegios
ON tb_circulares.colegio_circular = tb_colegios.id_colegio
LEFT JOIN tb_circulares_clase
ON tb_circulares.codigo_circular = tb_circulares_clase.circular
LEFT JOIN tb_clases
ON tb_circulares_clase.clase = tb_clases.id_clase
WHERE colegio_circular = 17;

Excluding a value from the return result of MySQL

I'm facing a problem and I'm not finding the answer. I'm querying a MySql table during my java process and I would like to exclude some rows from the return of my query.
Here is the query:
SELECT
o.offer_id,
o.external_cat,
o.cat,
o.shop,
o.item_id,
oa.value
FROM
offer AS o,
offerattributes AS oa
WHERE
o.offer_id = oa.offer_id
AND (cat = 1200000 OR cat = 12050200
OR cat = 13020304
OR cat = 3041400
OR cat = 3041402)
AND (oa.attribute_id = 'status_live_unattached_pregen'
OR oa.attribute_id = 'status_live_attached_pregen'
OR oa.attribute_id = 'status_dead_offer_getter'
OR oa.attribute_id = 'most_recent_status')
AND (oa.value = 'OK'
OR oa.value='status_live_unattached_pregen'
OR oa.value='status_live_attached_pregen'
OR oa.value='status_dead_offer_getter')
The trick here is that I need the value to be 'OK' in order to continue my process but I don't need mysql to return it in its response, I only need the other values to be returned, for the moment its returning two rows by query, one with the 'OK' value and another with one of the other values.
I would like the return value to be like this:
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'status_live_attached_pregen'
for my query, but it returns:
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'OK'
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'status_live_attached_pregen'
Some help would really be appreciated.
Thank you !
You can solve this with an INNER JOIN on the self I think:
SELECT o.offer_id
,o.external_cat
,o.cat
,o.shop
,o.item_id
,oa.value
FROM offer AS o
INNER JOIN offerattributes AS oa
ON o.offer_id = oa.offer_id
INNER JOIN offerattributes AS oaOK
ON oaOK.offer_id = oa.offer_id
AND oaOK.value = 'OK'
WHERE o.cat IN (1200000,12050200,13020304,3041400,3041402)
AND oa.attribute_id IN ('status_live_unattached_pregen','status_live_attached_pregen','status_dead_offer_getter','most_recent_status')
AND oa.value IN ('status_live_unattached_pregen','status_live_attached_pregen','status_dead_offer_getter');
By doing a self-JOIN with the restriction of value OK, it will limit the result set to offer_ids that have an OK response, but the WHERE clause will still retrieve the values you need. Based on your description, I think this is what you were looking for.
I also converted your implicit cross JOIN to an explicit INNER JOIN, as well as changed your ORs to IN, should be more performant this way.

Eager loading hierarchical children with explicit self-joins and contains_eager in SQLAlchemy

Given the following relationships:
- 1 MasterProduct parent -> many MasterProduct children
- 1 MasterProduct child -> many StoreProducts
- 1 StoreProduct -> 1 Store
I have defined the following declarative models in SQLAlchemy:
class MasterProduct(Base):
__tablename__ = 'master_products'
id = Column(Integer, primary_key=True)
pid = Column(Integer, ForeignKey('master_products.id'))
children = relationship('MasterProduct', join_depth=1,
backref=backref('parent', remote_side=[id]))
store_products = relationship('StoreProduct', backref='master_product')
class StoreProduct(Base):
__tablename__ = 'store_products'
id = Column(Integer, primary_key=True)
mid = Column(Integer, ForeignKey('master_products.id'))
sid = Column(Integer, ForeignKey('stores.id'))
timestamp = Column(DateTime)
store = relationship('Store', uselist=False)
class Store(Base):
__tablename__ = 'stores'
id = Column(Integer, primary_key=True)
My goal is to replicate the following query in SQLAlchemy with eager loading:
SELECT *
FROM master_products mp_parent
INNER JOIN master_products mp_child ON mp_child.pid = mp_parent.id
INNER JOIN store_products sp1 ON sp1.mid = mp_child.id
LEFT JOIN store_products sp2
ON sp1.mid = sp2.mid AND sp1.sid = sp2.sid AND sp1.timestamp < sp2.timestamp
WHERE mp_parent.id = 6752 AND sp2.id IS NULL
The query selects all MasterProduct children for parent 6752 and all
corresponding store products grouped by most recent timestamp using a NULL
self-join (greatest-n-per-group). There are 82 store products returned from the
query, with 14 master product children.
I've tried the following to no avail:
mp_child = aliased(MasterProduct)
sp1 = aliased(StoreProduct)
sp2 = aliased(StoreProduct)
q = db.session.query(MasterProduct).filter_by(id=6752) \
.join(mp_child, MasterProduct.children) \
.join(sp1, mp_child.store_products) \
.outerjoin(sp2, and_(sp1.mid == sp2.mid, sp1.sid == sp2.sid, sp1.timestamp < sp2.timestamp)) \
.filter(sp2.id == None) \
.options(contains_eager(MasterProduct.children, alias=mp_child),
contains_eager(MasterProduct.children, mp_child.store_products, alias=sp1))
>>> mp_parent = q.first() # the query below looks ok!
SELECT <all columns from master_products, master_products_1, and store_products_1>
FROM master_products INNER JOIN master_products AS master_products_1 ON master_products.id = master_products_1.pid INNER JOIN store_products AS store_products_1 ON master_products_1.id = store_products_1.mid LEFT OUTER JOIN store_products AS store_products_2 ON store_products_1.mid = store_products_2.mid AND store_products_1.sid = store_products_2.sid AND store_products_1.timestamp < store_products_2.timestamp
WHERE master_products.id = %s AND store_products_2.id IS NULL
LIMIT %s
>>> mp_parent.children # only *one* child is eagerly loaded (expected 14)
[<app.models.MasterProduct object at 0x2463850>]
>>> mp_parent.children[0].id # this is correct, 6762 is one of the children
6762L
>>> mp_parent.children[0].pid # this is correct
6752L
>>> mp_parent.children[0].store_products # only *one* store product is eagerly loaded (expected 7 for this child)
[<app.models.StoreProduct object at 0x24543d0>]
Taking a step back and simplifying the query to eagerly load just the children
also results in only 1 child being eagerly loaded instead of all 14:
mp_child = aliased(MasterProduct)
q = db.session.query(MasterProduct).filter_by(id=6752) \
.join(mp_child, MasterProduct.children)
.options(contains_eager(MasterProduct.children, alias=mp_child))
However, when I use a joinedload, joinedload_all, or subqueryload, all
14 children are eagerly loaded, i.e.:
q = db.session.query(MasterProduct).filter_by(id=6752) \
.options(joinedload_all('children.store_products', innerjoin=True))
So the problem seems to be populating MasterProduct.children from the
explicit join using contains_eager.
Can anyone spot the error in my ways or help point me in the right direction?
OK what you might observe in the SQL is that there's a "LIMIT 1" coming out. That's because you're using first(). We can just compare the first two queries, the contains eager, and the joinedload:
join() + contains_eager():
SELECT master_products_1.id AS master_products_1_id, master_products_1.pid AS master_products_1_pid, master_products.id AS master_products_id, master_products.pid AS master_products_pid
FROM master_products JOIN master_products AS master_products_1 ON master_products.id = master_products_1.pid
WHERE master_products.id = ?
LIMIT ? OFFSET ?
joinedload():
SELECT anon_1.master_products_id AS anon_1_master_products_id, anon_1.master_products_pid AS anon_1_master_products_pid, master_products_1.id AS master_products_1_id, master_products_1.pid AS master_products_1_pid
FROM (SELECT master_products.id AS master_products_id, master_products.pid AS master_products_pid
FROM master_products
WHERE master_products.id = ?
LIMIT ? OFFSET ?) AS anon_1 JOIN master_products AS master_products_1 ON anon_1.master_products_id = master_products_1.pid
you can see the second query is quite different; because first() means a LIMIT is applied, joinedload() knows to wrap the "criteria" query in a subquery, apply the limit to that, then apply the JOIN afterwards. In the join+contains_eager case, the LIMIT is applied to the collection itself and you get the wrong number of rows.
Just changing the script at the bottom to this:
for q, query_label in queries:
mp_parent = q.all()[0]
I get the output it says you're expecting:
[explicit join with contains_eager] children=3, store_products=27
[joinedload] children=3, store_products=27
[joinedload_all] children=3, store_products=27
[subqueryload] children=3, store_products=27
[subqueryload_all] children=3, store_products=27
[explicit joins with contains_eager, filtered by left-join] children=3, store_products=9
(this is why getting a user-created example is so important)