I opened up this new question because I'm not sure the user's request and wording matched each other: pandas left join where right is null on multiple columns
What is the equivalent pandas code to this SQL? Contextually we're finding entries from a column in table_y that aren't in table_x with respect to several columns.
SELECT
table_x.column,
table_x.column2,
table_x.column3,
table_y.column,
table_y.column2,
table_y.column3,
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
Is this it?
columns_join = ['column', 'column2']
data_y = data_y.set_index(columns_join)
data_x = data_x.set_index(columns_join)
data_diff = pandas.concat([data_x, data_y]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset - in case datasets are too large
x1 = data_x[data_x.index.isin(data_diff.index)]
x2 = data_y[data_y.index.isin(data_diff.index)]
# Perform an outer join with the joined indices from each set,
# then remove the entries only contributed from table_x
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True)
data_compare_final = (
data_compare
.query('_merge == left_join')
.drop('_merge', axis=1)
)
I don't think that's equivalent because we only removed entries from table_x that aren't in the join based on multiple columns. I think we have to continue and compare the column against table_y.
data_compare = data_compare.reset_index().set_index('column2')
data_y = data_y.reset_index().set_index('column2')
mask_column2 = data_y.index.isin(data_compare.index)
result = data_y[~mask_column2]
Without test data it is a bit difficult to be sure that this helps but you can try:
# Only if columns to join on in the right dataframe have the same name as columns in left
table_y[['col_join_1', 'col_join_2']] = table_y[['column', 'column2']] # Else this is not needed
# Merge left (LEFT JOIN)
table_merged = table_x.merge(
table_y,
how='left',
left_on=['column', 'column2'],
right_on=['col_join_1', 'col_join_2'],
suffixes=['_x', '_y']
)
# Filter dataframe
table_merged = table_merged.loc[
table_merged.column2_y.isna(),
['column_x', 'column2_x', 'column3_x', 'column_y', 'column2_y', 'column3_y']
]
I found an equivalent that amounts to setting the index to the join column(s), union'ing the tables, dropping the duplicates, and performing a cross join between the contributions to the union. From there, one can select
left_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
right_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_y
LEFT JOIN table_x
ON table_y.column = table_x.column
ON table_y.column2 = table_x.column2
WHERE
table_x.column2 is NULL
def create_dataframe_joined_diffs(dataframe_prod, dataframe_new, columns_join):
"""
Set the indices to the columns_key
Concat the dataframes and remove duplicates
Select the diff representative from each dataset
Reset the indices and perform an outer join
Pseudo-SQL:
SELECT
UNIQUE(*)
FROM dataframe_prod
OUTER JOIN dataframe_new
ON columns_join
"""
data_new = dataframe_new.set_index(columns_join)
data_prod = dataframe_prod.set_index(columns_join)
# Get any row not in both (may be removing too many)
data_diff = pandas.concat([data_prod, data_new]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset
x1 = data_prod[data_prod.index.isin(data_diff.index)]
x2 = data_new[data_new.index.isin(data_diff.index)]
# Perform an outer join and keep the joined indices from each set
# Sort the columns to make them easier to compare
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True).sort_index(axis=1)
return data_compare
mask_left = dataframe_compare['_merge'] == 'left_only'
mask_right = dataframe_compare['_merge'] == 'right_only'
Related
Some books have more than one author, I have a table with with book_id and author_id1, author_id2, author_id3, and author_id4. I have a table with author_id and author_name.
How can I join these two tables and the main table with book_id to get the authors names together in a data row from a sql query join.
Example:
SELECT book.book_id, book.title, author.author, book.location
FROM books AS b JOIN book_authors AS ba ON b.book_id = ba.book_id JOIN authors AS a ON REGEX ba.authors_id$ = a.authors_id
Not sure about REGEX ($) use in sql Should display id, title, authors, location
How do I get all authors_id# to match authors_id ( notice one has number at end other does not)?
update: So, I would like to get book_authors.authors_id1 to match authors.authors_id, book_authors.authors_id2 to match authors.authors_id, book_authors.authors_id3 to match authors.authors_id, book_authors.authors_id4 to match authors.authors_id and return all the matching authors in list.
...
# merge book_authors and authors into one dataframe
ba_df.rename(columns= {'authors_id1': 'authors_id'}, inplace=True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name1', 'authors_id2': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name2', 'authors_id3': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name3', 'authors_id4': 'authors_id'}, inplace = True)
ba_df['authors_id'] = ba_df['authors_id'].map(a_df.set_index('authors_id')['authors_name'])
ba_df.rename(columns = {'authors_id':'authors_name4'}, inplace = True)
...
Was working through another dataframe and got the idea to use map after rename to set_index the same on both dataframes. Now, the map lines can work, just have to rename the common column , so as not to overwrite, in this case it was authors_id, replaced with authors_name1, 2, 3 & 4, which equates to the authors_id1, 2, 3 & 4. And yes, it is not pure sql, but it works for python, which is where I had the problem.
I have this db structure
and this is my joined tables (sample data)
i want to filter when (key = 'price' and value > 4000) and (key = 'top-speed' and value > 200)
thanks for help)
Try this:
SELECT
car.id,
car.name,
key,
value
FROM
car c
LEFT JOIN car_specification_value csv ON c.id = csv.car_id
LEFT JOIN specification s ON csv.specification_id = s.id
WHERE
s.key = 'price'
AND csv.value > 4000
AND s.key = 'top-speed'
AND csv.value > 200;
A common practice is giving aliases to your tables so the join conditions can be determined in a shorter way.
One method uses aggregation:
select csv.car_id
from car_specification_value csv
where (csv.key = 'price' and (csv.value + 0) > 4000) and
(csv.key = 'top-speed' and (csv.value + 0) > 200)
group by csv.car_id
having count(distinct csv.key) = 2; -- both match
Note the + 0. This uses implicit conversion to change the value to a number, so it can be properly compared to a number. One challenge of key/value data structures is that all the values are strings, and that is tricky for other data types.
I have a query as follows:
SELECT *
FROM tb_circulares LEFT JOIN tb_colegios ON tb_circulares.colegio_circular = tb_colegios.id_colegio
LEFT JOIN tb_circulares_clase ON tb_circulares.codigo_circular = tb_circulares_clase.circular
LEFT JOIN tb_clases ON tb_circulares_clase.clase = tb_clases.id_clase
WHERE colegio_circular = 17
The query output shows three rows.
One of the row columns is the value for the field tb_circular_clase.nombre_clase
I would like to get a string containing the three resulting values for tb_circular_clase.nombre_clase.
For example:
row1-> nombre_clase = "1º primaria"
row2-> nombre_clase = "3ª secundaria"
row3-> nombre_clase = "4º primaria"
Is it possible to get a resulting query field with the final value?
`resultado = "1º primaria - 3ª secundaria - 4º primaria"
Thanks
Sounds like you're after group_concat(), which is an aggregation function concatenating all values of a group.
SELECT group_concat(nombre SEPARATOR ' - ') resultado
FROM tb_circulares
LEFT JOIN tb_colegios
ON tb_circulares.colegio_circular = tb_colegios.id_colegio
LEFT JOIN tb_circulares_clase
ON tb_circulares.codigo_circular = tb_circulares_clase.circular
LEFT JOIN tb_clases
ON tb_circulares_clase.clase = tb_clases.id_clase
WHERE colegio_circular = 17;
I'm facing a problem and I'm not finding the answer. I'm querying a MySql table during my java process and I would like to exclude some rows from the return of my query.
Here is the query:
SELECT
o.offer_id,
o.external_cat,
o.cat,
o.shop,
o.item_id,
oa.value
FROM
offer AS o,
offerattributes AS oa
WHERE
o.offer_id = oa.offer_id
AND (cat = 1200000 OR cat = 12050200
OR cat = 13020304
OR cat = 3041400
OR cat = 3041402)
AND (oa.attribute_id = 'status_live_unattached_pregen'
OR oa.attribute_id = 'status_live_attached_pregen'
OR oa.attribute_id = 'status_dead_offer_getter'
OR oa.attribute_id = 'most_recent_status')
AND (oa.value = 'OK'
OR oa.value='status_live_unattached_pregen'
OR oa.value='status_live_attached_pregen'
OR oa.value='status_dead_offer_getter')
The trick here is that I need the value to be 'OK' in order to continue my process but I don't need mysql to return it in its response, I only need the other values to be returned, for the moment its returning two rows by query, one with the 'OK' value and another with one of the other values.
I would like the return value to be like this:
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'status_live_attached_pregen'
for my query, but it returns:
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'OK'
'000005261383370', '10020578', '1200000', '562', '1000000_157795705', 'status_live_attached_pregen'
Some help would really be appreciated.
Thank you !
You can solve this with an INNER JOIN on the self I think:
SELECT o.offer_id
,o.external_cat
,o.cat
,o.shop
,o.item_id
,oa.value
FROM offer AS o
INNER JOIN offerattributes AS oa
ON o.offer_id = oa.offer_id
INNER JOIN offerattributes AS oaOK
ON oaOK.offer_id = oa.offer_id
AND oaOK.value = 'OK'
WHERE o.cat IN (1200000,12050200,13020304,3041400,3041402)
AND oa.attribute_id IN ('status_live_unattached_pregen','status_live_attached_pregen','status_dead_offer_getter','most_recent_status')
AND oa.value IN ('status_live_unattached_pregen','status_live_attached_pregen','status_dead_offer_getter');
By doing a self-JOIN with the restriction of value OK, it will limit the result set to offer_ids that have an OK response, but the WHERE clause will still retrieve the values you need. Based on your description, I think this is what you were looking for.
I also converted your implicit cross JOIN to an explicit INNER JOIN, as well as changed your ORs to IN, should be more performant this way.
Given the following relationships:
- 1 MasterProduct parent -> many MasterProduct children
- 1 MasterProduct child -> many StoreProducts
- 1 StoreProduct -> 1 Store
I have defined the following declarative models in SQLAlchemy:
class MasterProduct(Base):
__tablename__ = 'master_products'
id = Column(Integer, primary_key=True)
pid = Column(Integer, ForeignKey('master_products.id'))
children = relationship('MasterProduct', join_depth=1,
backref=backref('parent', remote_side=[id]))
store_products = relationship('StoreProduct', backref='master_product')
class StoreProduct(Base):
__tablename__ = 'store_products'
id = Column(Integer, primary_key=True)
mid = Column(Integer, ForeignKey('master_products.id'))
sid = Column(Integer, ForeignKey('stores.id'))
timestamp = Column(DateTime)
store = relationship('Store', uselist=False)
class Store(Base):
__tablename__ = 'stores'
id = Column(Integer, primary_key=True)
My goal is to replicate the following query in SQLAlchemy with eager loading:
SELECT *
FROM master_products mp_parent
INNER JOIN master_products mp_child ON mp_child.pid = mp_parent.id
INNER JOIN store_products sp1 ON sp1.mid = mp_child.id
LEFT JOIN store_products sp2
ON sp1.mid = sp2.mid AND sp1.sid = sp2.sid AND sp1.timestamp < sp2.timestamp
WHERE mp_parent.id = 6752 AND sp2.id IS NULL
The query selects all MasterProduct children for parent 6752 and all
corresponding store products grouped by most recent timestamp using a NULL
self-join (greatest-n-per-group). There are 82 store products returned from the
query, with 14 master product children.
I've tried the following to no avail:
mp_child = aliased(MasterProduct)
sp1 = aliased(StoreProduct)
sp2 = aliased(StoreProduct)
q = db.session.query(MasterProduct).filter_by(id=6752) \
.join(mp_child, MasterProduct.children) \
.join(sp1, mp_child.store_products) \
.outerjoin(sp2, and_(sp1.mid == sp2.mid, sp1.sid == sp2.sid, sp1.timestamp < sp2.timestamp)) \
.filter(sp2.id == None) \
.options(contains_eager(MasterProduct.children, alias=mp_child),
contains_eager(MasterProduct.children, mp_child.store_products, alias=sp1))
>>> mp_parent = q.first() # the query below looks ok!
SELECT <all columns from master_products, master_products_1, and store_products_1>
FROM master_products INNER JOIN master_products AS master_products_1 ON master_products.id = master_products_1.pid INNER JOIN store_products AS store_products_1 ON master_products_1.id = store_products_1.mid LEFT OUTER JOIN store_products AS store_products_2 ON store_products_1.mid = store_products_2.mid AND store_products_1.sid = store_products_2.sid AND store_products_1.timestamp < store_products_2.timestamp
WHERE master_products.id = %s AND store_products_2.id IS NULL
LIMIT %s
>>> mp_parent.children # only *one* child is eagerly loaded (expected 14)
[<app.models.MasterProduct object at 0x2463850>]
>>> mp_parent.children[0].id # this is correct, 6762 is one of the children
6762L
>>> mp_parent.children[0].pid # this is correct
6752L
>>> mp_parent.children[0].store_products # only *one* store product is eagerly loaded (expected 7 for this child)
[<app.models.StoreProduct object at 0x24543d0>]
Taking a step back and simplifying the query to eagerly load just the children
also results in only 1 child being eagerly loaded instead of all 14:
mp_child = aliased(MasterProduct)
q = db.session.query(MasterProduct).filter_by(id=6752) \
.join(mp_child, MasterProduct.children)
.options(contains_eager(MasterProduct.children, alias=mp_child))
However, when I use a joinedload, joinedload_all, or subqueryload, all
14 children are eagerly loaded, i.e.:
q = db.session.query(MasterProduct).filter_by(id=6752) \
.options(joinedload_all('children.store_products', innerjoin=True))
So the problem seems to be populating MasterProduct.children from the
explicit join using contains_eager.
Can anyone spot the error in my ways or help point me in the right direction?
OK what you might observe in the SQL is that there's a "LIMIT 1" coming out. That's because you're using first(). We can just compare the first two queries, the contains eager, and the joinedload:
join() + contains_eager():
SELECT master_products_1.id AS master_products_1_id, master_products_1.pid AS master_products_1_pid, master_products.id AS master_products_id, master_products.pid AS master_products_pid
FROM master_products JOIN master_products AS master_products_1 ON master_products.id = master_products_1.pid
WHERE master_products.id = ?
LIMIT ? OFFSET ?
joinedload():
SELECT anon_1.master_products_id AS anon_1_master_products_id, anon_1.master_products_pid AS anon_1_master_products_pid, master_products_1.id AS master_products_1_id, master_products_1.pid AS master_products_1_pid
FROM (SELECT master_products.id AS master_products_id, master_products.pid AS master_products_pid
FROM master_products
WHERE master_products.id = ?
LIMIT ? OFFSET ?) AS anon_1 JOIN master_products AS master_products_1 ON anon_1.master_products_id = master_products_1.pid
you can see the second query is quite different; because first() means a LIMIT is applied, joinedload() knows to wrap the "criteria" query in a subquery, apply the limit to that, then apply the JOIN afterwards. In the join+contains_eager case, the LIMIT is applied to the collection itself and you get the wrong number of rows.
Just changing the script at the bottom to this:
for q, query_label in queries:
mp_parent = q.all()[0]
I get the output it says you're expecting:
[explicit join with contains_eager] children=3, store_products=27
[joinedload] children=3, store_products=27
[joinedload_all] children=3, store_products=27
[subqueryload] children=3, store_products=27
[subqueryload_all] children=3, store_products=27
[explicit joins with contains_eager, filtered by left-join] children=3, store_products=9
(this is why getting a user-created example is so important)