Scrapy item yields repeat values

Scrapy item yields repeat values - duplicates

I would appreciate a nudge in the right direction with this problem.
Below is a spider that:
1. crawls listing page and retrieves each record's summary info (10 rows/page)
2. follows the URL for to extract detailed info on each individual record's page
3. goes to the next listing page
Problem: each record's detailed info is extracting fine but each record contains the summary info of the last record from the same listing page.
Simplified example:
URL DA Detail1 Detail2
9 9 0 0
9 9 1 1
9 9 2 2
9 9 3 3
9 9 4 4
9 9 5 5
9 9 6 6
9 9 7 7
9 9 8 8
9 9 9 9
With the scrapy shell, I can iterate through manually and get the correct values as shown below:
import scrapy
from cbury_scrapy.items import DA
for row in response.xpath('//table/tr[#class="datrack_resultrow_odd" or #class="datrack_resultrow_even"]'):
r = scrapy.Selector(text=row.extract(), type="html")
print r.xpath('//td[#class="datrack_danumber_cell"]//text()').extract_first(), r.xpath('//td[#class="datrack_danumber_cell"]//#href').extract_first()[-5:]
Output
SC-18/2016 HQQM=
DA-190/2016 HQwQ=
DA-192/2016 HQAk=
S68-122/2016 HQgM=
DA-191/2016 HQgc=
DA-223/2015/A HQQY=
DA-81/2016/A GSgY=
PCA-111/2016 GSwU=
PCD-101/2016 GSwM=
PCD-100/2016 GRAc=
When the spider is run, the last record summary details will repeat for each record on the same listing page. Please see the spider below, the offending method seems to be the first 10 lines of the parse method.
""" Run under bash with:
timenow=`date +%Y%m%d_%H%M%S`; scrapy runspider cbury_spider.py -o cbury-scrape-$timenow.csv
Problems? Interactively check Xpaths etc.:
scrapy shell "http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged""""
import scrapy
from cbury_scrapy.items import DA
def td_text_after(label, response):
""" retrieves text from first td following a td containing a label e.g.:"""
return response.xpath("//*[contains(text(), '" + label + "')]/following-sibling::td//text()").extract_first()
class CburySpider(scrapy.Spider):
# scrapy.Spider attributes
name = "cbury"
allowed_domains = ["datrack.canterbury.nsw.gov.au"]
start_urls = ["http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged",]
# required for unicode character replacement of '$' and ',' in est_cost
translation_table = dict.fromkeys(map(ord, '$,'), None)
da = DA()
da['lga'] = u"Canterbury"
def parse(self, response):
""" Retrieve DA no., URL and address for DA on summary list page """
for row in response.xpath('//table/tr[#class="datrack_resultrow_odd" or #class="datrack_resultrow_even"]'):
r = scrapy.Selector(text=row.extract(), type="html")
self.da['da_no'] = r.xpath('//td[#class="datrack_danumber_cell"]//text()').extract_first()
self.da['house_no'] = r.xpath('//td[#class="datrack_houseno_cell"]//text()').extract_first()
self.da['street'] = r.xpath('//td[#class="datrack_street_cell"]//text()').extract_first()
self.da['town'] = r.xpath('//td[#class="datrack_town_cell"]//text()').extract_first()
self.da['url'] = r.xpath('//td[#class="datrack_danumber_cell"]//#href').extract_first()
# then retrieve remaining DA details from the detail page
yield scrapy.Request(self.da['url'], callback=self.parse_da_page)
# follow next page link if one exists
next_page = response.xpath("//*[contains(text(), 'Next')]/#href").extract_first()
if next_page:
yield scrapy.Request(next_page, self.parse)
def parse_da_page(self, response):
""" Retrieve DA information from its detail page """
labels = { 'date_lodged': 'Date Lodged:', 'desc_full': 'Description:',
'est_cost': 'Estimated Cost:', 'status': 'Status:',
'date_determined': 'Date Determined:', 'decision': 'Decision:',
'officer': 'Responsible Officer:' }
# map DA fields with those in the following <td> elements on the page
for i in labels:
self.da[i] = td_text_after(labels[i], response)
# convert est_cost text to int for easier sheet import "12,000" -> 12000
if self.da['est_cost'] != None:
self.da['est_cost'] = int(self.da['est_cost'].translate(self.translation_table))
# Get people data from 'Names' table with 'Role' heading
self.da['names'] = []
for row in response.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
da_name = {}
da_name['role'] = row.xpath('normalize-space(./td[1])').extract_first()
da_name['name_no'] = row.xpath('normalize-space(./td[2])').extract_first()
da_name['full_name'] = row.xpath('normalize-space(./td[3])').extract_first()
self.da['names'].append(da_name)
yield self.da
Your help would be much appreciated.

Scrapy is asynchronous, once you've submitted a request there's no guarantee when that request will be actioned. Because of this your self.da is unreliable for passing data to parse_da_page. Instead create da_items = DA() in your parse routine and pass it in the request as meta.
for row in response.xpath(...):
da_items = DA()
da_items['street'] = row.xpath(...)
...
da_items['url'] = row.xpath(...)
yield scrapy.Request(da_items['url'], callback=self.parse_da_page, meta=da_items)
Then in parse_da_page you can retrieve these values using response.meta['street'] etc.. Have a look at the docs here.
Note also that your line r = scrapy.Selector(text=row.extract(), type="html") is redundant, you can simply use the variable row directly as I've done in my example above.

Related

How to extract JSON from a multiline string variable in Python

I have a multiline string variable which includes multiline of system log like below, and I would like to extract the JSON part only
System 123456
Logs start 2021-07-03 12:00:00
<event> {log_in_json}
<event> {log_in_json}
I using find to search over the string variable but this only allow me to get the first occurrence. Anyone could advise?
start = var.find(<event>)
end = var.find("}}")
extracted_line = var[start:end+len("}}")]
json_str = extracted_line.lstrip(<event>)
print(json_str)

Using the optional second argument to the find method, we can set the starting
point for the search. So, second and following times around, we'll start where
we previously found the last match (end), until the method returns -1:
var = '''
System 123456
Logs start 2021-07-03 12:00:00
<event> {log_in_json}
<event> {log_in_json2}
'''
start = var.find('<event>')
while start > 0:
end = var.find("}", start)
extracted_line = var[start:end+len("}")]
json_str = extracted_line.lstrip('<event> ')
print(json_str)
start = var.find('<event>', end)
# {log_in_json}
# {log_in_json2}

Using for on register from a json file - Python

I m not sure where is my error but i believe its with for. As it is, it print values from last record only. I tried to put print below statusInput = profile_data['_statusinput'] inside for but didn't work got:
IndentationError: unindent does not match any outer indentation level
Code
import json
with open('config/'+'config.json', 'r') as file:
data: list = json.load(file)
lista = data
for element in lista:
print("")
for alias_element in element:
#print("Alias: " +alias_element)
for result in element[alias_element]:
profile_data = result
aliasInput = profile_data['_aliasinput']
timesInput = profile_data['_timesinput']
idInput = profile_data['_idinput']
statusInput = profile_data['_statusinput']
print(f" Values from register are {aliasInput}{timesInput}{idInput}{statusInput}")
Result
Last record value only.
Example:
Values from register are test2 12:45 19:20 888888 true
Expected
Print values of all records on the screen. Understanding it... Also d like to add a condition that print only if statusInput == true
Example:
Values from register are test 10:20 11111 true
Values from register are test1 11:50 99999 true
Values from register are test2 12:45 19:20 888888 true

Collecting Tweets using max_id is not working as expected

Iam currently doing a tweet search using Twitter Api. However, taking the tweet id is not working for me.
Here is my code:
searchQuery = '#BLM' # this is what we're searching for
searchQuery = searchQuery + "-filter:retweets"
Geocode="39.8, -95.583068847656, 2500km"
maxTweets = 1000000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'tweetsBLM.json' # We'll store the tweets in a json file.
sinceId = None
#max_id = -1 # initial search
max_id=1278836959926980609 # the last id of previous search
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery,lang="en", geocode=Geocode,
count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery,lang="en",geocode=Geocode,
count=tweetsPerQry,
since_id=sinceId )
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1) )
else:
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now() + dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
This code works perfectly fine. I initially run it and got about 40 000 tweets. Then i took the id of the last tweet of previous/initial search to go back in time. However, i was disappointed to see that there were no tweets anymore. I can not believe that for a second. I must be going wrong somewhere because #BLM has been very active in the last 2/3 months.
Any help is very welcome. Thank you

I may have found the answer. Using Twitter API, it is not possible to get older tweets (7 days old or more). Using max_id to get around this is not possible either.
The only way is to stream and wait for more than 7 days.
Finally, there is also this link that look for older tweets
https://pypi.org/project/GetOldTweets3/ it is an extension of the original Jefferson Henrique's work

writer.writerow() doesn't write to the correct column

I have three DynamoDB tables. Two tables have instance IDs that are part of an application and the other is a master table of all instances across all of my accounts and the tag metadata. I have two scans for the two tables to get the instance IDs and then query the master table for the tag metadata. However, when I try writing this to the CSV file, I want to have two separate header sections for each dynamo table's unique output. Once the first iteration is done, the second file write writes to the last row where the first iteration left off instead of starting over at the top in the second header section. Below is my code and an output example to make it clear.
CODE:
import boto3
import csv
import json
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
s3 = boto3.resource('s3')
# Required resource and client calls
all_instances_table = dynamodb.Table('Master')
missing_response = dynamo.scan(TableName='T1')
installed_response = dynamo.scan(TableName='T2')
# Creates CSV DictWriter object and fieldnames
with open('file.csv', 'w') as csvfile:
fieldnames = ['Agent Not Installed', 'Not Installed Account', 'Not Installed Tags', 'Not Installed Environment', " ", 'Agent Installed', 'Installed Account', 'Installed Tags', 'Installed Environment']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
# Find instances IDs from the missing table in the master table to pull tag metadata
for instances in missing_response['Items']:
instance_missing = instances['missing_instances']['S']
#print("Missing:" + instance_missing)
query_missing = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_missing))
for item_missing in query_missing['Items']:
missing_id = item_missing['ID']
missing_account = item_missing['Account']
missing_tags = item_missing['Tags']
missing_env = item_missing['Environment']
# Write the data to the CSV file
writer.writerow({'Agent Not Installed': missing_id, 'Not Installed Account': missing_account, 'Not Installed Tags': missing_tags, 'Not Installed Environment': missing_env})
# Find instances IDs from the installed table in the master table to pull tag metadata
for instances in installed_response['Items']:
instance_installed = instances['installed_instances']['S']
#print("Installed:" + instance_installed)
query_installed = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_installed))
for item_installed in query_installed['Items']:
installed_id = item_installed['ID']
print(installed_id)
installed_account = item_installed['Account']
installed_tags = item_installed['Tags']
installed_env = item_installed['Environment']
# Write the data to the CSV file
writer.writerow({'Agent Installed': installed_id, 'Installed Account': installed_account, 'Installed Tags': installed_tags, 'Installed Environment': installed_env})
OUTPUT:
This is what the columns/rows look like in the file.
I need all of the output to be on the same line for each header section.
DATA:
Here is a sample of what both tables look like.
SAMPLE OUTPUT:
Here is what the for loops print out and appends to the lists.
Missing:
i-0xxxxxx 333333333 foo#bar.com int
i-0yyyyyy 333333333 foo1#bar.com int
Installed:
i-0zzzzzz 44444444 foo2#bar.com int
i-0aaaaaa 44444444 foo3#bar.com int

You want to collect related rows together into a single list to write on a single row, something like:
missing = [] # collection for missing_responses
installed = [] # collection for installed_responses
# Find instances IDs from the missing table in the master table to pull tag metadata
for instances in missing_response['Items']:
instance_missing = instances['missing_instances']['S']
#print("Missing:" + instance_missing)
query_missing = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_missing))
for item_missing in query_missing['Items']:
missing_id = item_missing['ID']
missing_account = item_missing['Account']
missing_tags = item_missing['Tags']
missing_env = item_missing['Environment']
# Update first half of row with missing list
missing.append(missing_id, missing_account, missing_tags, missing_env)
# Find instances IDs from the installed table in the master table to pull tag metadata
for instances in installed_response['Items']:
instance_installed = instances['installed_instances']['S']
#print("Installed:" + instance_installed)
query_installed = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_installed))
for item_installed in query_installed['Items']:
installed_id = item_installed['ID']
print(installed_id)
installed_account = item_installed['Account']
installed_tags = item_installed['Tags']
installed_env = item_installed['Environment']
# update second half of row by updating installed list
installed.append(installed_id, installed_account, installed_tags, installed_env)
# combine your two lists outside a loop
this_row = []
i = 0;
for m in missing:
# iterate through the first half to concatenate with the second half
this_row.append( m + installed[i] )
i = i +1
# adding an empty column after the write operation, manually, is optional
# Write the data to the CSV file
writer.writerow(this_row)
This will work if your installed and missing tables operate on a relatable field - like a timestamp or an account ID, something that you can ensure keeps the rows being concatenated in the same order. A data sample would be useful to really answer the question.

Is it possible to write a table to a file in JSON format in R?

I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.

set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...

These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.

RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scrapy item yields repeat values - duplicates

Related

How to extract JSON from a multiline string variable in Python

Using for on register from a json file - Python

Collecting Tweets using max_id is not working as expected

writer.writerow() doesn't write to the correct column

Is it possible to write a table to a file in JSON format in R?

Categories

Resources