I was trying luigi multiprocessing capability by utilizing luigi.build method.
but i'm getting some library error while executing.
for next in self._add(item, is_complete):
File "/home/manoj/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 604, in _add
self._validate_dependency(d)
File "/home/manoj/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 622, in _validate_dependency
raise Exception('requires() must return Task objects')
here is piece of code i tried to achieve my objective.
import luigi
class TaskOne(luigi.Task):
custid= luigi.Parameter()
def requires(self):
pass
def output(self):
return luigi.LocalTarget("logs/"+str(self.custid)+"_success")
def run(self):
with self.output().open('w') as f:
f.write("%s\n" % '')
class TaskTwo(luigi.Task):
def requires(self):
customersList = ['A','B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
yield luigi.build([TaskOne(custid=cust_id) for cust_id in customersList], workers=2)
def output(self):
return luigi.LocalTarget("logs/overall_success.txt")
def run(self):
with self.output().open('w') as f:
f.write("%s\n" % "success")
if __name__ == '__main__':
luigi.run()
========================================================================
Why do you think you need to build in requires?
class TaskTwo(luigi.Task):
def requires(self):
customersList = ['A','B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
return [TaskOne(custid=cust_id) for cust_id in customersList]
If you want multiple workers, you can specify this at the command line when you start your pipeline.
luigi --module your_module TaskTwo --workers 2
requires() must return a luigi.Task object, or a list of luigi.Task objects. However, luigi.build() doesn't return anything. You don't need to call luigi.build to explicitly run the tasks, because Luigi handles running requirements on its own. The example task outlined in https://luigi.readthedocs.io/en/stable/tasks.html shows the basic paradigm of how it's supposed to work.
Also, you should omit requires() from TaskOne. If it has no dependencies, then there is no need to define it.
Related
I was trying to create a Generative Adverserial Network using PyTorch. I coded the discriminator block and printed the summary. After that, I moved to create Generator block. I defined forward() function and reshaped the input noise dimensions from (batch_size, noise_dim) to (batch_size, channel, height, width). While running the code for getting summary, the error popped saying 'NoneType' object is not callable. I searched stackoverflow and other places but my problem didn't resolved.
I first created a generator block function with the following code:
def get_gen_block(in_channels, out_channels, kernel_size, stride, final_block = False):
if final_block == True:
return nn.Sequential(
nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride),
nn.Tanh()
)
return nn.Sequential(
nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
Then I defined a class for generator block to create several class and defined forward() function linke this:
class Generator(nn.Module):
def __init__(self, noise_dim):
super(Generator, self).__init__()
self.noise_dim = noise_dim
self.block_1 = get_gen_block(noise_dim, 256, (3, 3), 2)
self.block_2 = get_gen_block(256, 128, (4, 4), 1)
self.block_3 = get_gen_block(128, 64, (3, 3), 2)
self.block_4 = get_gen_block(64, 1, (4, 4), 2, final_block=True)
def forward(self, r_noise_vec):
x = r_noise_vec.view(-1, self.noise_dim, 1, 1)
x1 = self.block_1(x)
x2 = self.block_2(x1)
x3 = self.block_3(x2)
x4 = self.block_4(x3)
return x4
After this, when I was printing summary for the generator, this error occured pointing to the line 'x1 = self.block_1(x)' saying 'NoneType' object is not callable. Anyone please help me in resolving this issue.
Please check your get_gen_block function, looks like you missed else: branch or messed up the indentation and when final_block = False it returns None instead of
return nn.Sequential(
nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
if cond:
return module1
return module2
Always returns module1 when condition is met, otherwise None.
I think you wanted this
if cond:
return module1
return module2
when condition is met return module1 otherwise module2. and now compare the indentation.
I received error : too many values to unpack (expected 2) , when running the below code. anyone can help me? I added more details.
import gensim
import gensim.corpora as corpora
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50, per_word_topics = True, eval_every = 1)
print(ldamodel.print_topics(num_topics=3, num_words=20))
for i in range (0,46):
for index, score in sorted(ldamodel[doc_term_matrix[i]], key=lambda tup: -1*tup[1]):
print("subject", i)
print("\n")
print("Score: {}\t \nTopic: {}".format(score, ldamodel.print_topic(index, 6)))
Focusing on the loop, since this is where the error is being raised. Let's take it one iteration at a time.
>>> import numpy as np # just so we can use np.shape()
>>> i = 0 # value in first loop
>>> x = sorted( ldamodel[doc_term_matrix[i]], key=lambda tup: -1*tup[1] )
>>> np.shape(x)
(3, 3, 2)
>>> for index, score in x:
... pass
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)
Here is where your error is coming from. You are expecting this returned matrix to have 2 elements, however it is a multislice matrix with no simple infer-able way to unpack it. I do not personally have enough experience with this subject material to be able to infer what you might mean to be doing, I can only show you where your problem is coming from. Hope this helps!
I am writing a unittest that queries the trello board API and want to assert that a particular card exists.
The first attempt was using the /1/boards/[board_id]/lists rewuest which gives results like:
[{'cards': [
{'id': 'id1', 'name': 'item1'},
{'id': 'id2', 'name': 'item2'},
{'id': 'id3', 'name': 'item3'},
{'id': 'id4', 'name': 'item4'},
{'id': 'id5', 'name': 'item5'},
{'id': 'id6', 'name': 'item6'}],
'id': 'id7',
'name': 'ABC'},
{'cards': [], 'id': 'id8', 'name': 'DEF'},
{'cards': [], 'id': 'id9', 'name': 'GHI'}]
I want to assert that 'item6' is indeed in the above mentioned list. Loading the json and using assertTrue, like this:
element = [item for item in json_data if item['name'] == "item6"]
self.assertTrue(element)
but I receive an error: 'TypeError: the JSON object must be str, bytes or bytearray, not 'list'.
Then discovered using the /1/boards/[board_id]/cards request gives a plain list of cards:
[
{'id': 'id1', 'name': 'item1'},
{'id': 'id2', 'name': 'item2'},
...
]
How should I write this unittest assertion?
The neatest option is to create a class that will equal the dict for the card you want to ensure is there, then use that in an assertion. For your example, with a list of cards returned over the api:
cards = board.get_cards()
self.assertIn(Card(name="item6"), cards)
Here's a reasonable implementation for the Card() helper class, it may look a little complex but is mostly straight forward:
class Card(object):
"""Class that matches a dict with card details from json api response."""
def __init__(self, name):
self.name = name
def __eq__(self, other):
if isinstance(other, dict):
return other.get("name", None) == self.name
return NotImplemented
def __repr__(self):
return "{}({!r}, {!r})".format(
self.__class__.__name__, self.key, self.value)
You could add more fields to validate as needed.
One question worth touching on at this point is whether the unit test should be making real api queries. Generally a unit test would have test data to just focus on the function you control, but perhaps this is really an integration test for your trello deployment using the unittest module?
import unittest
from urllib.request import urlopen
import json
class Basic(unittest.TestCase):
url = 'https://api.trello.com/1/boards/[my_id]/cards?fields=id,name,idList,url&key=[my_key]&token=[my_token]'
response = urlopen(url)
resp = response.read()
json_ob = json.loads(resp)
el_list = [item for item in json_ob if item['name'] == 'card6']
def testBasic(self):
self.assertTrue(self.el_list)
if __name__ == '__main__':
unittest.main()
So what I did wrong: I focused too much on the list itself which I got after using the following code:
import requests
from pprint import pprint
import json
url = "https://api.trello.com/1/boards/[my_id]/lists"
params = {"cards":"open","card_fields":"name","fields":"name","key":"[my_key]","token":"[my_token]"}
response = requests.get(url=url, params=params)
pprint(response.json())
I'm trying to build a small app for a university project with Scrapy.
The spider is scraping the items, but my pipeline is not inserting data into mysql database. In order to test whether the pipeline is not working or the pymysl implementation is not working I wrote a test script:
Code Start
#!/usr/bin/python3
import pymysql
str1 = "hey"
str2 = "there"
str3 = "little"
str4 = "script"
db = pymysql.connect("localhost","root","**********","stromtarife" )
cursor = db.cursor()
cursor.execute("SELECT * FROM vattenfall")
cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (str1, str2, str3, str4))
cursor.execute("SELECT * FROM vattenfall")
data = cursor.fetchone()
print(data)
db.commit()
cursor.close()
db.close()
Code End
After i run this script my database has a new record, so its not my pymysql.connect() function, which is broke.
I'll provide my scrapy code:
vattenfall_form.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from stromtarife.items import StromtarifeItem
from scrapy.http import FormRequest
class VattenfallEasy24KemptenV1500Spider(scrapy.Spider):
name = 'vattenfall-easy24-v1500-p87435'
def start_requests(self):
return [
FormRequest(
"https://www.vattenfall.de/de/stromtarife.htm",
formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
"electricity_consumptionprivate": "1500", "street": "", "hno": ""},
callback=self.parse
),
]
def parse(self, response):
item = StromtarifeItem()
item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
yield item
class VattenfallEasy24KemptenV2500Spider(scrapy.Spider):
name = 'vattenfall-easy24-v2500-p87435'
def start_requests(self):
return [
FormRequest(
"https://www.vattenfall.de/de/stromtarife.htm",
formdata={"place": "87435", "zipCode": "87435", "cityName": "Kempten",
"electricity_consumptionprivate": "2500", "street": "", "hno": ""},
callback=self.parse
),
]
def parse(self, response):
item = StromtarifeItem()
item['jahrespreis'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[3]/td[2]/text()').extract_first()
item['treuebonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[2]/td/strong/text()').extract_first()
item['sofortbonus'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[2]/form[1]/div/div[2]/table/tbody/tr[1]/td/strong/text()').extract_first()
item['tarif'] = response.xpath('/html/body/main/div[1]/div[2]/div/div[3]/div[2]/div/div[1]/h2/span/text()').extract_first()
yield item
process = CrawlerProcess()
process.crawl(VattenfallEasy24KemptenV1500Spider)
process.crawl(VattenfallEasy24KemptenV2500Spider)
process.start()
pipelines.py
import pymysql
from stromtarife.items import StromtarifeItem
class StromtarifePipeline(object):
def __init__(self):
self.connection = pymysql.connect("localhost","root","**********","stromtarife")
self.cursor = self.connection.cursor()
def process_item(self, item, spider):
self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
self.connection.commit()
self.cursor.close()
self.connection.close()
settings.py (i changed only that line)
ITEM_PIPELINES = {
'stromtarife.pipelines.StromtarifePipeline': 300,
}
So what is wrong with my code ? I couldn't figure it out and would be really happy if someone is seeing something i'm missing. Thanks in advance!
You should not close your pymsql connection every time you process an item.
You should write the close_spider function in your pipeline like this, so the connection is closed just once, at the end of the execution:
def close_spider(self, spider):
self.cursor.close()
self.connection.close()
Moreover you neeed to return your item at the end of process_item
Your file pipeline.py should look like this:
import pymysql
from stromtarife.items import StromtarifeItem
class StromtarifePipeline(object):
def __init__(self):
self.connection = pymysql.connect("localhost","root","**********","stromtarife")
self.cursor = self.connection.cursor()
def process_item(self, item, spider):
self.cursor.execute("INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)", (item['tarif'], item['sofortbonus'], item['treuebonus'], item['jahrespreis']))
self.connection.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.connection.close()
UPDATE :
I tried your code, the problem is in the pipeline, there are two problems:
You try to index the euro symbol € and I think mysql does not like it.
Your query string is not well built.
I managed to get things done by writting the pipeline like this:
def process_item(self, item, spider):
query = """INSERT INTO vattenfall (tarif, sofortbonus, treuebonus, jahrespreis) VALUES (%s, %s, %s, %s)""" % ("1", "2", "3", "4")
self.cursor.execute(query)
self.connection.commit()
return item
I thing you should remove the € from the prices you try to insert.
Hope this helps, let me know.
There is another problem with your scraper besides the fact that your SQL Pipeline closes the SQL connection after writing the first item (as Adrien pointed out).
The other problem is: your scraper only scrapes one single item per results page (and also visits only one results page). I checked Vattenfall and there are usually multiple results displayed and I guess you want to scrape them all.
Means you'll also have to iterate over the results on the page and create multiple items while doing so. The scrapy tutorial here gives a good explanation how to do this: https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-quotes-and-authors
First of all, in Code Start print(data) must come after db.commit(), otherwise the data which was just inserted into your database will not show up in the print.
Lastly, judging by the names of your columns, it's probably an issue of encoding if the idea above doesn't work.
I have 3 very large files(+100 MB) file_hash, cert_hash and url_data each has one string per line. The problem is that size of data in all these files are not same. I have used izip_longest function to read all these files at once (can't load these files in memory) but I wanted to iterate it for the longest file (file_hash is longest) and suppose all data from cert_hash has been read it should start taking the values from beginning of cert_hash file and similarly if url_data got's over it also starts reading from beginning. I have tried using fillvalue parameter but it takes only one value, I wanted to give different value for cert_hash and url_data if they get over.
You should cycle cert_hash and url_data if you want them to restart. For example:
>>> from itertools import cycle, izip
>>> for t in izip("abcdef", cycle("ghi"), cycle("jklm")):
print t
('a', 'g', 'j')
('b', 'h', 'k')
('c', 'i', 'l')
('d', 'g', 'm')
('e', 'h', 'j')
('f', 'i', 'k')
Note that you no longer use izip_longest, as cycle is infinite.
If you want to restart at the end rather than the start, here is a tweak to the cycle equivalent implementation that achieves that:
>>> def zigzag(iterable):
"""zigzag('ABCD') --> A B C D C B A B C D ..."""
forward = []
for element in iterable:
yield element
forward.append(element)
backward = forward[-2:0:-1]
while True:
for element in backward:
yield element
for element in forward:
yield element
>>> z = zigzag("ABCD")
>>> for _ in range(10):
print next(z)
A
B
C
D
C
B
A
B
C
D