I have a DB with various tables. I see myself writing various function of this form :
def getAgeAndHealthFromUser()
def getIDPriceLifeFromProduct()
def getIDFromUser()
def getPriceFromProduct()
def setIDFromUser()
def setPriceFromProduct()
.. and so on.
Basically I am selecting / setting multiple columns of different tables most of the time. I hope you get it.
That is when I tried the generic function approach which takes different column names, table name as input and does the work.
I want to know does this approach has any potential problems I might get into ? Is this the right thing to do design wise ?
Its doable , however you need to considered in such API the below :
For Set API - Each table has different column and different data types .
For Get API - when you selecting data you select columns ( your result set structure is different per query case ) , however your query filter is different from use case to use case .
You can consider interact with the API using some kind of generic Json interface or something else , but at end you may pay for this extra generic API in performances .
Thanks
Related
I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...
Context:
I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.
Approach:
I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table).
Then I'm trying to use the approach of unboxing and unnesting.
Let's call that json column data:
def transformTable (glueContext, dfc) -> DynamicFrameCollection:
dyf = dfc.select(list(dfc.keys())[0])
dyf = Unbox.apply(frame=dyf, path="data", format="json")
dyf = UnnestFrame.apply(frame=dyf)
return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)
(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)
My issue:
Glue automatically infers the data types of the my frame schema (rather successfully)
but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.
So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).
I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.
Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)
Questions/help needed:
Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...
Thanks in advance!
I have two tables, one called Company and the other called User, each user is related to one company using ForeignKey. So I can use reverse relation in Django to get all users for specific company (e.g. company.users)
In my case, I'm building ListAPIView which return multiple companies, and I'd like to return latest created user. My problem is that I don't want to use prefetch_related or select_related so it will load all the users, as we might end up having thousands of users per each company! Also I don't want to load each latest user in a separate query so we end up having tens of queries per API request!
I've tried something like this:
users_qs = models.User.objects.filter(active=True).order_by('-created')
company_qs = models.Company.objects.prefetch_related(
Prefetch('users', queryset=users_qs[:1], to_attr='user')
).order_by('-created')
In this case, prefetch_related failed as we can't set limit on the Prefetch's queryset filter (it gives this error "Cannot filter a query once a slice has been taken.")
Any ideas?
I think you are providing an object instead of a queryset Prefetch('users', queryset=users_qs[:1], to_attr='user')
I have a design problem with SQL request:
I need to return data looking like:
listChannels:
-idChannel
name
listItems:
-data
-data
-idChannel
name
listItems:
-data
-data
The solution I have now is to send a first request:
*"SELECT * FROM Channel WHERE idUser = ..."*
and then in the loop fetching the result, I send for each raw another request to feel the nested list:
"SELECT data FROM Item WHERE idChannel = ..."
It's going to kill the app and obviously not the way to go.
I know how to use the join keyword, but it's not exactly what I want as it would return a row for each data of each listChannels with all the information of the channels.
How to solve this common problem in a clean and efficient way ?
The "SQL" way of doing this produces of table with columns idchannel, channelname, and the columns for item.
select c.idchannel, c.channelname, i.data
from channel c join
item i
on c.idchannel = i.idchannel
order by c.idchannel, i.item;
Remember that a SQL query returns a result set in the form of a table. That means that all the rows have the same columns. If you want a list of columns, then you can do an aggregation and put the items in a list:
select c.idchannel, c.channelname, group_concat(i.data) as items
from channel c join
item i
on c.idchannel = i.idchannel
group by c.idchannel, c.channelname;
The above uses MySQL syntax, but most databases support similar functionality.
SQL is made for accessing two-dimensional data tables. (There are more possibilities, but they are very complex and maybe not standardized)
So the best way to solve your problem is to use multiple requests. Please also consider using transactions, if possible.
I am struggling with a basic problem. i am using cake php 2.5. i try to apply the find query in the company model and receiving all the data from companies and with its associations, but i only want to receive the data from company table and want to exclude the data from rest of relationships, can anyone help me with this. below are my queries.
$this->loadModel('Company');
$fields=array('id','name','logo','status');
$conditions=array('status'=>1);
$search_companies = $this->Company->find('first',
compact(array('conditions'=>$conditions,'fields'=>$fields)));
print_r($search_companies);die();
echo json_encode($search_companies);die();
With out seeing your data output, I am just going to take a stab at the problem.
Inside your $search_companies variable you are getting a multidimensional array probably with the other values of the other tables.
Why not just select the one array:
$wantedData = $search_companies['Company'];
// The key Company (which is the model) should be the data you are wanting.
Try setting model's recursive value to -1
$this->Company->recursive = -1;
$search_companies = $this->Company->find('first',
compact(array('conditions'=>$conditions,'fields'=>$fields)));
With this you will not fire the joins queries and therefore you only retrieve model's information.
Cakephp provide this functionality that we can unblind few/all associations on a any model. the keyword unbindModel is used for this purpose. inside the unblindModel you can define the association type and model(s) name that you want to unblind for that specific association.
$this->CurrentModelName->unbindModel(array('AssociationName' => array('ModelName_Youwwant_unblind')));
I'm learning sqlalchemy and not sure if I grasp it fully yet(I'm more used to writing queries by hand but I like the idea of abstracting the queries and getting objects). I'm going through the tutorial and trying to apply it to my code and ran into this part when defining a model:
def __repr__(self):
return "<User('%s','%s', '%s')>" % (self.name, self.fullname, self.password)
Its useful because I can just search for a username and get only the info about the user that I want but is there a way to either have multiple of these type of views that I can call? or am I using it wrong and should be writing a specific query for getting different data for different views?
Some context to why I'm asking my site has different templates, and most pages will just need the usersname, first/last name but some pages will require things like twitter or Facebook urls(also fields in the model).
First of all, __repr__ is not a view, so if you have a simple model User with defined columns, and you query for a User, all the columns will get loaded from the database, and not only those used in __repr__.
Lets take model Book (from the example refered to later) as a basis:
class Book(Base):
book_id = Column(Integer, primary_key=True)
title = Column(String(200), nullable=False)
summary = Column(String(2000))
excerpt = Column(Text)
photo = Column(Binary)
The first option to skip loading some columns is to use Deferred Column Loading:
class Book(Base):
# ...
excerpt = deferred(Column(Text))
photo = deferred(Column(Binary))
In this case when you execute query session.query(Book).get(1), the photo and excerpt columns will not be loaded until accessed from the code, at which point another query against the database will be executed to load the missing data.
But if you know before you query for the Book that you need the column photo immediately, you can still override the deferred behavior with undefer option: query = session.query(Book).options(undefer('photo')).get(1).
Basically, the suggestion here is to defer all the columns (in your case: except username, password etc) and in each use case (view) override with undefer those you know you need for that particular view. Please also see the group parameter of deferred, so that you can group the attributes by use case (view).
Another way would be to query only some columns, but in this case you are getting the tuple instance instead of the model instance (in your case User), so it is potentially OK for form filling, but not so good for model validation: session.query(Book.id, Book.title).all()