I currently have this class for scraping products from a single retailer website using Nokogiri. XPath, CSS path details are stored in MySQL.
ActiveRecord::Base.establish_connection(
:adapter => "mysql2",
...
)
class Site < ActiveRecord::Base
has_many :site_details
def create_product_links
# http://www.example.com
p = Nokogiri::HTML(open(url))
p.xpath(total_products_path).each {|lnk| SiteDetail.find_or_create_by(url: url + "/" + lnk['href'], site_id: self.id)}
end
end
class SiteDetail < ActiveRecord::Base
belongs_to :site
def get_product_data
# http://www.example.com
p = Nokogiri::HTML(open(url))
title = p.css(site.title_path).text
price = p.css(site.price_path).text
description = p.css(site.description_path).text
update_attributes!(title: title, price: price, description: description)
end
end
# Execution
#s = Site.first
#s.site_details.get_product_data
I will be adding more sites (around 700) in the future. Each site have a different page structure. So get_product_data method cannot be used as is. I may have to use case or if statement to jump and execute relevant code. Soon this class becomes quite chunky and ugly (700 retailers).
What is the best design approach suitable in this scenario?
Like #James Woodward said, you're going to want to create a class for each retailer. The pattern I'm going to post has three parts:
A couple of ActiveRecord classes that implement a common interface for storing the data you want to record from each site
700 different classes, one for each site you want to scrape. These classes implement the algorithms for scraping the sites, but don't know how to store the information in the database. To do that, they rely on the common interface from step 1.
One final class that ties it all together running each of the scraping algorithms you wrote in step 2.
Step 1: ActiveRecord Interface
This step is pretty easy. You already have a Site and SiteDetail class. You can keep them for storing the data you scrape from website in your database.
You told the Site and SiteDetail classes how to scrape data from websites. I would argue this is inappropriate. Now you've given the classes two responsibilities:
Persist data in the database
Scrape data from the websites
We'll create new classes do handle the scraping responsibility in the second step. For now, you can strip down the Site and SiteDetail classes so that they only act as database records:
class Site < ActiveRecord::Base
has_many :site_details
end
class SiteDetail < ActiveRecord::Base
belongs_to :site
end
Step 2: Implement Scrapers
Now, we'll create new classes that handle the scraping responsibility. If this were a language that supported abstract classes or interfaces like Java or C#, we would proceed like so:
Create an IScraper or AbstractScraper interface that handles the tasks common to scraping a website.
Implement a different FooScraper class for each of the sites you want to scrape, each one inheriting from AbstractScraper or implementing IScraper.
Ruby doesn't have abstract classes, though. What it does have is duck typing and module mix-ins. This means we'll use this very similar pattern:
Create a SiteScraper module that handles the tasks common to scraping a website. This module will assume that the classes that extend it have certain methods it can call.
Implement a different FooScraper class for each of the sites you want to scrape, each one mixing in the SiteScraper module and implementing the methods the module expects.
It looks like this:
module SiteScraper
# Assumes that classes including the module
# have get_products and get_product_details methods
#
# The get_product_urls method should return a list
# of the URLs to visit to get scraped data
#
# The get_product_details the URL of the product to
# scape as a string and return a SiteDetail with data
# scraped from the given URL
def get_data
site = Site.new
product_urls = get_product_urls
for product_url in product_urls
site_detail = get_product_details product_url
site_detail.site = site
site_detail.save
end
end
end
class ExampleScraper
include 'SiteScraper'
def get_product_urls
urls = []
p = Nokogiri::HTML(open('www.example.com/products'))
p.xpath('//products').each {|lnk| urls.push lnk}
return urls
end
def get_product_details(product_url)
p = Nokogiri::HTML(open(product_url))
title = p.css('//title').text
price = p.css('//price').text
description = p.css('//description').text
site_detail = SiteDetail.new
site_detail.title = title
site_detail.price = price
site_detail.description = description
return site_detail
end
end
class FooBarScraper
include 'SiteScraper'
def get_product_urls
urls = []
p = Nokogiri::HTML(open('www.foobar.com/foobars'))
p.xpath('//foo/bar').each {|lnk| urls.push lnk}
return urls
end
def get_product_details(product_url)
p = Nokogiri::HTML(open(product_url))
title = p.css('//foo').text
price = p.css('//bar').text
description = p.css('//foo/bar/iption').text
site_detail = SiteDetail.new
site_detail.title = title
site_detail.price = price
site_detail.description = description
return site_detail
end
end
... and so on, creating a class that mixes in SiteScraper and implements get_product_urls and get_product_details for each one of the 700 website you need to scrape. Unfortunately, this is the tedious part of the pattern: There's no real way to get around writing a different scraping algorithm for all 700 sites.
Step 3: Run Each Scraper
The final step is to create the cron job that scrapes the sites.
every :day, at: '12:00am' do
ExampleScraper.new.get_data
FooBarScraper.new.get_data
# + 698 more lines
end
Related
I have two models in consideration. RV_Offers and RV_Details. Each offer can have multiple details i.e. I have a foreignkey relationship field in RV_Details table.
Here is my view:
rv_offers_queryset = RV_Offers.objects.all().select_related('vendor').prefetch_related('details')
details_queryset = RV_Details.objects.all().select_related('rv_offer')
title = Subquery(details_queryset.filter(
rv_offer=OuterRef("id"),
).order_by("-created_at").values("original_title")[:1])
offers_queryset = rv_offers_queryset.annotate(
title=title).filter(django_query)
offers = RVOffersSerializer(offers_queryset, many=True).data
return Response({'result': offers}, status=HTTP_200_OK)
As can be seen, I am passing the offers queryset to the serializer.
Now, here is my serializer:
class RVOffersSerializer(serializers.ModelSerializer):
details = serializers.SerializerMethodField()
vendor = VendorSerializer()
def get_details(self, obj):
queryset = RV_Details.objects.all().select_related('rv_offer')
queryset = queryset.filter(rv_offer=obj.id).latest('created_at')
return RVDetailsSerializer(queryset).data
class Meta:
model = RV_Offers
fields = '__all__'
If you look at the get_details method, I am trying to fetch the latest detail that belongs to an offer. My problem is, even though I am using select_related to optimize the query, the results are still extremely slow, In fact, I am using django debug toolbar to inspect the query and apparently select_related seems to have no effect.
What am I doing wrong or how else can I approach this problem?
This is what I did to reduce the number of queries being hit on the db:
def get_details(self, obj):
details = obj.details.last()
return RVDetailsSerializer(details).data
I was able to reduce the number of queries from 45 to 4 using this.
This is because in the view, I had already used select_related to make the queryset, which in turn is being used here using obj.
I am trying to "merge" one record into another record with all it's relationship children.
For example:
I have vendor1 and vendor2 which both have many relations that contain other has_many. For example a vendor has many purchase_orders and a purchase order has many ordered_items and an ordered_item has many received_items.
If I change the vendor2's name to be the same as vendor1's name then I want to destroy vendor2 but move all of it's has_many to vendor1.
This is what I've been trying to do:
def vendor_merge(main_vendor, merge_vendor)
relationships = [
merge_vendor.returns, merge_vendor.receiving_and_bills,
merge_vendor.bills, merge_vendor.purchase_orders, merge_vendor.taxes,
Check.where(payee_id: merge_vendor.id, payee_type: "Vendor"),
JournalEntryAccount.where(payee_id: merge_vendor.id)
]
relationships.each do |relationship|
class_name = relationship.class.name
relationship.each do |r|
if class_name === "Check"
r.update(payee_id: main_vendor.id)
else
r.update(vendor_id: main_vendor.id)
end
r.save
end
relationship.delete_all
end
merge_vendor.destroy
end
Doing it this way gives me constraint errors because of the has_many of the has_manys and because of the has_many through: :ect...
Any straight forward solution to this?
You will need a merge logic defined in your app. This could be a PORO (plain old ruby object), like VendorMerger, which holds all the logic in order to merge a Vendor record into another (this could also be inside the Vendor model but it would pollute your model).
Here is an example of that PORO:
# lib/vendor_merger.rb
class VendorMerger
def initialize(vendor_from, vendor_to)
#vendor_from = vendor_from
#vendor_to = vendor_to
end
def perform!
validate_before_merge!
ActiveRecord::Base.connection.transaction do # will rollback if an error is raised in this block
migrate_related_records!
destroy_after_merge!
end
end
private
def validate_before_merge!
raise ArgumentError, 'Trying to merge the same record' if #vendor_from == #vendor_to
raise ArgumentError, 'A vendor is not persisted' if #vendor_from.new_record? || #vendor_to.new_record?
# ...
end
def migrate_related_records!
# see my thought (1) below
#vendor_from.purchases.each do |purchase|
purchase.vendor = #vendor_to
# ...
purchase.save!
end
end
def destroy_after_merge!
#vendor_from.reload.destroy!
end
Usage:
VendorMerger.new(Vendor.first, Vendor.last).perform!
This PORO allows you to contain all the logic related to the merge into one file. It respects the SRP (Single responsibility principle) and makes the testing very easy, as well as maintenance (ex: include a Logger, custom Error objects, etc).
Thought (1): You can either go by manually retrieving the data to be merged (as in my example), but this means if some day you add another relation to the Vendor model, let's say Vendor has_many :customers but forgot to add it to the VendorMerger, then it will "fail silently" since VendorMerger is not aware of the new relation :customers. To solve this, you can dynamically grab all models having a reference to Vendor (where column is vendor_id OR the class_name option is equal to 'Vendor' OR the relation is polymorphic and the XX_type column holds a 'Vendor' value) and convert all those foreign from the old to the new ID.
I am starting to learn rails and have run into a problem. I am writing a simple application (similar to the twitter tutorials I have seen) where a user logs in and creates a new post.
When a user logs in, I am setting the session information as follows
session[:id] = authorized_user.id
session[:email] = authorized_user.email
So now I have the ID of the user logged in. Upon login, the user is brought to a form where they can submit a new post (3 fields.) When user clicks submit, I want to create a new record with the data they entered, and associate the record to that user (User ID). I am not exactly sure how to do this.
Below is the code on the controller:
def create
#Used for creating new status posts
#Need to get the ID of the user logged in
#user = AdminUser.find(session[:id])
#Instantiate new object using form parameters
#post = Post.new(post_params)
#post.AdminUser = #user # THIS IS THE LINE NOT WORKING
#Save the object
if #post.save
#If save succeeds, redirect to the index action
flash[:notice] = "Status has been saved"
redirect_to(:action => 'index')
else
#If the save fails, redisplay the form so user can fix problems
render('new')
end
end
Here is the Private method for post_params
def post_params
#Defining the params that are allowed to be passed with forms.
params.require(:post).permit(:post_status, :post_title, :post_content)
end
The record is saved but the UserID for the record is NULL.
My first instict was to try to pass UserID as a post parameter, but i think this is a potential security risk, so I am trying to figure out an alternate way. I am sure it is something simple and I am just missing it.
Attributes
Firstly,
#post.AdminUser = #user # THIS IS THE LINE NOT WORKING
You should use snake_case for your attribute names (you're using CamelCase). Calling an attribute AdminUser has all sorts of potential issues which will arise down the line.
Call it admin_user or admin_id or something similar
--
Params
Secondly,
I want to create a new record with the data they entered, and
associate the record to that user (User ID)
If you're trying to save a "dependent" record for an object (for example, saving a post for a user), you'll have to assign the user_id record yourself, and pass it through the params, like so:
#app/controllers/posts_controller.rb
Class PostsController < ApplicationController
def create
#post = Post.new(post_params)
#post.save
end
private
def post_params
params.require(:post).permit(:title, :body).merge({user_id: authorized_user.id})
end
end
When you create an element in your app, you're basically just taking data from the params hash & sending to the model to save. This is done using the strong_params functionality introduced in Rails 4:
def post_params
params.require(:post).permit(:title, :body).merge({user_id: authorized_user.id})
end
As you can see from my example above, you basically need to be able to send through the user_id / admin_id / AdminUser value through to the model (so it can save)
You can also do this by setting the attribute as the example below:
#app/controllers/posts_controller.rb
def create
#post = Post.new(post_params)
#post.user_id = authorized_user.id
#post.save
end
private
def post_params
params.require(:post).permit(:title, :body, :user_id)
end
--
You should also look at the difference between authentication & autorhization for better definition of your logged-in user object :)
Rewrite the line like this, taking UserID as the column name in posts table
#post.UserID = #user.id
I'm doing some test with Sinatra v1.4.4 and Active Record v4.0.2. I've created a DBase and a table named Company with Mysql Workbench. In table Company there are two fields lat & long of DECIMAL(10,8) and DECIMAL(11,8) type respectively. Without using migrations I defined the Company model as follow:
class Company < ActiveRecord::Base
end
Everything works except the fact that lat and lng are served as string and not as float/decimal. Is there any way to define the type in the above Class Company definition. Here you can find the Sinatra route serving the JSON response:
get '/companies/:companyId' do |companyId|
begin
gotCompany = Company.find(companyId)
[200, {'Content-Type' => 'application/json'}, [{code:200, company: gotCompany.attributes, message: t.company.found}.to_json]]
rescue
[404, {'Content-Type' => 'application/json'}, [{code:404, message:t.company.not_found}.to_json]]
end
end
Active Record correctly recognize them as decimal. For example, executing this code:
Company.columns.each {|c| puts c.type}
Maybe its the Active Record object attributes method typecast?
Thanks,
Luca
You can wrap the getter methods for those attributes and cast them:
class Company < ActiveRecord::Base
def lat
read_attribute(:lat).to_f
end
def lng
read_attribute(:lng).to_f
end
end
That will convert them to floats, e.g:
"1.61803399".to_f
=> 1.61803399
Edit:
Want a more declarative way? Just extend ActiveRecord::Base:
# config/initializers/ar_type_casting.rb
class ActiveRecord::Base
def self.cast_attribute(attribute, type_cast)
define_method attribute do
val = read_attribute(attribute)
val.respond_to?(type_cast) ? val.send(type_cast) : val
end
end
end
Then use it like this:
class Company < ActiveRecord::Base
cast_attribute :lat, :to_f
cast_attribute :lng, :to_f
end
Now when you call those methods on an instance they will be type casted to_f.
Following diego.greyrobot reply I modified my Company class with an additional method. It overrides the attributes method and afterwards typecast the needed fields. Yet something more declarative would be desirable imho.
class Company < ActiveRecord::Base
def attributes
retHash = super
retHash['lat'] = self.lat.to_f
retHash['lng'] = self.lng.to_f
retHash
end
end
I'm using both django-taggit and django-filter in my web application, which stores legal decisions. My main view (below) inherits from the stock django-filter FilterView and allows people to filter the decisions by both statutes and parts of statutes.
class DecisionListView(FilterView):
context_object_name = "decision_list"
filterset_class = DecisionFilter
queryset = Decision.objects.select_related().all()
def get_context_data(self, **kwargs):
# Call the base implementation to get a context
context = super(DecisionListView, self).get_context_data(**kwargs)
# Add in querysets for all the statutes
context['statutes'] = Statute.objects.select_related().all()
context['tags'] = Decision.tags.most_common().distinct()
return context
I also tag decisions by topic when they're added and I'd like people to be able to filter on that too. I currently have the following in models.py:
class Decision(models.Model):
citation = models.CharField(max_length = 100)
decision_making_body = models.ForeignKey(DecisionMakingBody)
statute = models.ForeignKey(Statute)
paragraph = models.ForeignKey(Paragraph)
...
tags = TaggableManager()
class DecisionFilter(django_filters.FilterSet):
class Meta:
model = Decision
fields = ['statute', 'paragraph']
I tried adding 'tags' to the fields list in DecisionFilter but that had no effect, presumably because a TaggableManager is a Manager rather than a field in the database. I haven't found anything in the docs for either app that covers this. Is it possible to filter on taggit tags?
You should be able to use 'tags__name' as the search/filter field. Check out the Filtering section on http://django-taggit.readthedocs.org/en/latest/api.html#filtering