Use XMLFeedSpider to parse html and xml - html

I have a webpage were I take the RSS links from. The links are XML and I would like to use the XMLFeedSpider functionality to simplify the parsing.
Is that possible?
This would be the flow:
GET example.com/rss (return HTML)
Parse html and get RSS links
foreach link parse XML

I found a simply way based on the existing example in the documentation and looking at the source code. Here is my solution:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def start_request(self):
urls = ['http://www.example.com/get-feed-links']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
for el in response.css("li.feed-links"):
yield scrapy.Request(el.css("a::attr(href)").extract_first(),
callback=self.parse)
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('#id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item

Related

Getting AttributeError error 'str' object has no attribute 'get'

I am getting an error while working with JSON response:
Error: AttributeError: 'str' object has no attribute 'get'
What could be the issue?
I am also getting the following errors for the rest of the values:
***TypeError: 'builtin_function_or_method' object is not subscriptable
'Phone': value['_source']['primaryPhone'],
KeyError: 'primaryPhone'***
# -*- coding: utf-8 -*-
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'main'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Phone': value['_source']['primaryPhone'],
"Email": value['_source']['primaryEmail'],
"City": value.get['_source']['city'],
"Zip Code": value.get['_source']['zipcode'],
"Website": value['_source']['websiteURL'],
"Facebook": value['_source']['facebookURL'],
"LinkedIn": value['_source']['LinkedIn_URL'],
"Twitter": value['_source']['Twitter'],
"BIO": value['_source']['Bio']
}
It's nested deeper than what you think it is. That's why you're getting an error.
Code Example
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Primary Phone':value['_source']['primaryPhone']
}
Explanation
The resp variable is creating a python dictionary, but there is no resp['hits']['hits']['fullName'] within this JSON data. The data you're looking for, for fullName is actually resp['hits']['hits'][i]['_source']['fullName']. i being an number because resp['hits']['hits'] is a list.
resp['hits'] is a dictionary and therefore the values variable is fine.
But resp['hits']['hits'] is a list, therefore you can't use the get request, and it's only accepts numbers as values within [], not strings. Hence the error.
Tips
Use response.json() instead of json.loads(response.body), since Scrapy v2.2, scrapy now has support for json internally. Behind the scenes it already imports json.
Also check the json data, I used requests for ease and just getting nesting down till I got the data you needed.
Yielding a dictionary is fine for this type of data as it's well structured, but any other data that needs modifying or changing or is wrong in places. Use either Items dictionary or ItemLoader. There's a lot more flexibility in those two ways of yielding an output than yielding a dictionary. I almost never yield a dictionary, the only time is when you have highly structured data.
Updated Code
Looking at the JSON data, there are quite a lot of missing data. This is part of web scraping you will find errors like this. Here we use a try and except block, for when we get a KeyError which means python hasn't been able to recognise the key associated with a value. We have to handle that exception, which we do here by saying to yield a string 'No XXX'
Once you start getting gaps etc it's better to consider an Items dictionary or Itemloaders.
Now it's worth looking at the Scrapy docs about Items. Essentially Scrapy does two things, it extracted data from websites, and it provides a mechanism for storing this data. The way it does this is storing it in a dictionary called Items. The code isn't that much different from yielding a dictionary but Items dictionary allows you to manipulate the extracted data more easily with extra things scrapy can do. You need to edit your items.py first with the fields you want. We create a class called TestItem, we define each field using scrapy.Field(). We then can import this class in our spider script.
items.py
import scrapy
class TestItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_name = scrapy.Field()
Phone = scrapy.Field()
Email = scrapy.Field()
City = scrapy.Field()
Zip_code = scrapy.Field()
Website = scrapy.Field()
Facebook = scrapy.Field()
Linkedin = scrapy.Field()
Twitter = scrapy.Field()
Bio = scrapy.Field()
Here we're specifying what we want the fields to be, you can't use a string with spaces unfortunately hence why full name is full_name. The field() creates the field of the item dictionary for us.
We import this item dictionary into our spider script with from ..items import TestItem. The from ..items means we're taking the items.py from the parent folder to the spider script and we're importing the class TestItem. That way our spider can populate the items dictionary with our json data.
Note that just before the for loop we instantiate the class TestItem by item = TestItem(). Instantiate means to call upon the class, in this case it makes a dictionary. This means we are creating the item dictionary and then we populate that dictionary with keys and values. You have to does this before you add your keys and values as you can see from within the for loop.
Spider script
import scrapy
import json
from ..items import TestItem
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = response.json()['hits']['hits']
item = TestItem()
for value in values:
try:
item['full_name'] = value['_source']['fullName']
except KeyError:
item['full_name'] = 'No Name'
try:
item['Phone'] = value['_source']['primaryPhone']
except KeyError:
item['Phone'] = 'No Phone number'
try:
item["Email"] = value['_source']['primaryEmail']
except KeyError:
item['Email'] = 'No Email'
try:
item["City"] = value['_source']['activeLocations'][0]['city']
except KeyError:
item['City'] = 'No City'
try:
item["Zip_code"] = value['_source']['activeLocations'][0]['zipcode']
except KeyError:
item['Zip_code'] = 'No Zip code'
try:
item["Website"] = value['AgentMarketingCenter'][0]['Website']
except KeyError:
item['Website'] = 'No Website'
try:
item["Facebook"] = value['_source']['AgentMarketingCenter'][0]['Facebook_URL']
except KeyError:
item['Facebook'] = 'No Facebook'
try:
item["Linkedin"] = value['_source']['AgentMarketingCenter'][0]['LinkedIn_URL']
except KeyError:
item['Linkedin'] = 'No Linkedin'
try:
item["Twitter"] = value['_source']['AgentMarketingCenter'][0]['Twitter']
except KeyError:
item['Twitter'] = 'No Twitter'
try:
item["Bio"]: value['_source']['AgentMarketingCenter'][0]['Bio']
except KeyError:
item['Bio'] = 'No Bio'
yield item

How to export entire body of HTML to nested JSON

I am using Python3 and Scrapy.
I have a simple spider (shown bellow) for which I want to save as items the response.url and the response.text. I would like to save the response.text to Notepad++ as a JSON. Is there any way it can be saved with a nested structure? Such as the one that appears in the native HTML og the page?
class Spider1(scrapy.Spider):
name = "Spider1"
allowed_domains = []
start_urls = ['http://www.uam.es/']
def parse(self, response):
items = Spider1Item()
items['url'] = response.url
items['body'] = response.text
yield items
pass
EDIT:
Here is a snippet of the target structure I would like when I export to a JSON.
HTML snippet of target structure

BeautifulSoup gets nothing between tags

I am a novice to write a web crawler. I want to use the searching engine of http://www.creditchina.gov.cn/search_all#keyword=&searchtype=0&templateId=&creditType=&areas=&objectType=2&page=1 to check whether my input is valid.
For example, 912101127157655762 is a valid input, and 912101127157655760 is invalid.
After observing the web source code from developer tools I found that, if the input is a invalid number, the tags would be:
While if the input is valid, the tags would be:
So I want to determine whether the input is valid or not by just checking whether there is anything within the 'ul class = "credit-info-results public-results-left item-template"' tag. Here is how I wrote my web crawler:
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO
However, the value of check is always []. I cannot understand why there is nothing between the tab. Hope somebody may help me solve the problem.
You've got not html but JS object as a response. That's why BS can't parse it.
You can use substring search to check if response contains something or not.
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
ul_pos = bs.find('credit-info-results public-results-left item-template')
if ul_pos <> 0:
bs = bs[ul_pos:]
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO

Not able to store data scraped with scrapy in json or csv format

Here I want to store the data from the list given on a website page. If I'm running the commands
response.css('title::text').extract_first() and
response.css("article div#section-2 li::text").extract()
individually in the scrapy shell it is showing expected output in shell.
Below is my code which is not storing data in json or csv format:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "medical"
start_urls = ['https://medlineplus.gov/ency/article/000178.html/']
def parse(self, response):
yield
{
'topic': response.css('title::text').extract_first(),
'symptoms': response.css("article div#section-2 li::text").extract()
}
I have tried to run this code using
scrapy crawl medical -o medical.json
You need to fix your URL, it is https://medlineplus.gov/ency/article/000178.htm and not https://medlineplus.gov/ency/article/000178.html/.
Also, and more importantly, you need to define an Item class and yield/return it from the parse() callback of your spider:
import scrapy
class MyItem(scrapy.Item):
topic = scrapy.Field()
symptoms = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "medical"
allowed_domains = ['medlineplus.gov']
start_urls = ['https://medlineplus.gov/ency/article/000178.htm']
def parse(self, response):
item = MyItem()
item["topic"] = response.css('title::text').extract_first()
item["symptoms"] = response.css("article div#section-2 li::text").extract()
yield item

Using Django Rest Framework, how can I upload a file AND send a JSON payload?

I am trying to write a Django Rest Framework API handler that can receive a file as well as a JSON payload. I've set the MultiPartParser as the handler parser.
However, it seems I cannot do both. If I send the payload with the file as a multi part request, the JSON payload is available in a mangled manner in the request.data (first text part until the first colon as the key, the rest is the data). I can send the parameters in standard form parameters just fine - but the rest of my API accepts JSON payloads and I wanted to be consistent. The request.body cannot be read as it raises *** RawPostDataException: You cannot access body after reading from request's data stream
For example, a file and this payload in the request body:
{"title":"Document Title", "description":"Doc Description"}
Becomes:
<QueryDict: {u'fileUpload': [<InMemoryUploadedFile: 20150504_115355.jpg (image/jpeg)>, <InMemoryUploadedFile: Front end lead.doc (application/msword)>], u'{%22title%22': [u'"Document Title", "description":"Doc Description"}']}>
Is there a way to do this? Can I eat my cake, keep it and not gain any weight?
Edit:
It was suggested that this might be a copy of Django REST Framework upload image: "The submitted data was not a file". It is not. The upload and request is done in multipart, and keep in mind the file and upload of it is fine. I can even complete the request with standard form variables. But I want to see if I can get a JSON payload in there instead.
For someone who needs to upload a file and send some data, there is no straight fwd way you can get it to work. There is an open issue in json api specs for this. One possibility i have seen is to use multipart/related as shown here, but i think its very hard to implement it in drf.
Finally what i had implemented was to send the request as formdata. You would send each file as file and all other data as text.
Now for sending the data as text you can have a single key called data and send the whole json as string in value.
Models.py
class Posts(models.Model):
id = models.UUIDField(default=uuid.uuid4, primary_key=True, editable=False)
caption = models.TextField(max_length=1000)
media = models.ImageField(blank=True, default="", upload_to="posts/")
tags = models.ManyToManyField('Tags', related_name='posts')
serializers.py -> no special changes needed, not showing my serializer here as its too lengthy because of the writable ManyToMany Field implimentation.
views.py
class PostsViewset(viewsets.ModelViewSet):
serializer_class = PostsSerializer
parser_classes = (MultipartJsonParser, parsers.JSONParser)
queryset = Posts.objects.all()
lookup_field = 'id'
You will need custom parser as shown below for parsing json.
utils.py
from django.http import QueryDict
import json
from rest_framework import parsers
class MultipartJsonParser(parsers.MultiPartParser):
def parse(self, stream, media_type=None, parser_context=None):
result = super().parse(
stream,
media_type=media_type,
parser_context=parser_context
)
data = {}
# find the data field and parse it
data = json.loads(result.data["data"])
qdict = QueryDict('', mutable=True)
qdict.update(data)
return parsers.DataAndFiles(qdict, result.files)
The request example in postman
EDIT:
see this extended answer if you want to send each data as key value pair
I know this is an old thread, but I just came across this. I had to use MultiPartParser in order to get my file and extra data to come across together. Here's what my code looks like:
# views.py
class FileUploadView(views.APIView):
parser_classes = (MultiPartParser,)
def put(self, request, filename, format=None):
file_obj = request.data['file']
ftype = request.data['ftype']
caption = request.data['caption']
# ...
# do some stuff with uploaded file
# ...
return Response(status=204)
My AngularJS code using ng-file-upload is:
file.upload = Upload.upload({
url: "/api/picture/upload/" + file.name,
data: {
file: file,
ftype: 'final',
caption: 'This is an image caption'
}
});
I send JSON and an image to create/update a product object. Below is a create APIView that works for me.
Serializer
class ProductCreateSerializer(serializers.ModelSerializer):
class Meta:
model = Product
fields = [
"id",
"product_name",
"product_description",
"product_price",
]
def create(self,validated_data):
return Product.objects.create(**validated_data)
View
from rest_framework import generics,status
from rest_framework.parsers import FormParser,MultiPartParser
class ProductCreateAPIView(generics.CreateAPIView):
queryset = Product.objects.all()
serializer_class = ProductCreateSerializer
permission_classes = [IsAdminOrIsSelf,]
parser_classes = (MultiPartParser,FormParser,)
def perform_create(self,serializer,format=None):
owner = self.request.user
if self.request.data.get('image') is not None:
product_image = self.request.data.get('image')
serializer.save(owner=owner,product_image=product_image)
else:
serializer.save(owner=owner)
Example test:
def test_product_creation_with_image(self):
url = reverse('products_create_api')
self.client.login(username='testaccount',password='testaccount')
data = {
"product_name" : "Potatoes",
"product_description" : "Amazing Potatoes",
"image" : open("local-filename.jpg","rb")
}
response = self.client.post(url,data)
self.assertEqual(response.status_code,status.HTTP_201_CREATED)
#Nithin solution works but essentially it means you are sending JSON as strings and hence not using the actual application/json inside the multipart segments.
What we want is to make the backend accept data in the below format
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="file"; filename="1x1_noexif.jpeg"
Content-Type: image/jpeg
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="myjson"; filename="blob"
Content-Type: application/json
{"hello":"world"}
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="isDownscaled"; filename="blob"
Content-Type: application/json
false
------WebKitFormBoundaryrga771iuUYap8BB2--
MultiPartParser works with the above format but will treat those jsons as files. So we simply unmarshal those jsons by putting them to data.
parsers.py
from rest_framework import parsers
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
# Any 'File' found having application/json as type will be moved to data
mutable_data = data.data.copy()
unmarshaled_blob_names = []
json_parser = parsers.JSONParser()
for name, blob in data.files.items():
if blob.content_type == 'application/json' and name not in data.data:
mutable_data[name] = json_parser.parse(blob)
unmarshaled_blob_names.append(name)
for name in unmarshaled_blob_names:
del data.files[name]
data.data = mutable_data
return data
settings.py
REST_FRAMEWORK = {
..
'DEFAULT_PARSER_CLASSES': [
..
'myproject.parsers.MultiPartJSONParser',
],
}
This should work now.
The final bit is testing. Since the test client that ships with Django and REST doesn't support multipart JSON, we work around that by wrapping any JSON data.
import io
import json
def JsonBlob(obj):
stringified = json.dumps(obj)
blob = io.StringIO(stringified)
blob.content_type = 'application/json'
return blob
def test_simple(client, png_3x3):
response = client.post(f'http://localhost/files/', {
'file': png_3x3,
'metadata': JsonBlob({'lens': 'Sigma 35mm'}),
}, format='multipart')
assert response.status_code == 200
If you're getting an error along the lines of Incorrect type. Expected pk value, received list., with #nithin's solution, it's because Django's QueryDict is getting in the way - it's specifically structured to use a list for each entry in the dictionary, and thus:
{ "list": [1, 2] }
when parsed by MultipartJsonParser yields
{ 'list': [[1, 2]] }
which trips up your serializer.
Here is an alternative which handles this case, specifically expecting the _data key for your JSON:
from rest_framework import parsers
import json
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
json_data_field = data.data.get('_data')
if json_data_field is not None:
parsed = json.loads(json_data_field)
mutable_data = {}
for key, value in parsed.items():
mutable_data[key] = value
mutable_files = {}
for key, value in data.files.items():
if key != '_data':
mutable_files[key] = value
return parsers.DataAndFiles(mutable_data, mutable_files)
json_data_file = data.files.get('_data')
if json_data_file:
parsed = parsers.JSONParser().parse(json_data_file)
mutable_data = {}
for key, value in parsed.items():
mutable_data[key] = value
mutable_files = {}
for key, value in data.files.items():
mutable_files[key] = value
return parsers.DataAndFiles(mutable_data, mutable_files)
return data
It is very simple to use a multipart post and a regular view, if this is an option.
You send the json as a field and files as files, then process in one view.
Here is a simple python client and a Django server:
Client - sending multiple files and an arbitrary json-encoded object:
import json
import requests
payload = {
"field1": 1,
"manifest": "special cakes",
"nested": {"arbitrary":1, "object":[1,2,3]},
"hello": "word" }
filenames = ["file1","file2"]
request_files = {}
url="example.com/upload"
for filename in filenames:
request_files[filename] = open(filename, 'rb')
r = requests.post(url, data={'json':json.dumps(payload)}, files=request_files)
Server - consuming the json and saving the files:
#csrf_exempt
def upload(request):
if request.method == 'POST':
data = json.loads(request.POST['json'])
try:
manifest = data['manifest']
#process the json data
except KeyError:
HttpResponseServerError("Malformed data!")
dir = os.path.join(settings.MEDIA_ROOT, "uploads")
os.makedirs(dir, exist_ok=True)
for file in request.FILES:
path = os.path.join(dir,file)
if not os.path.exists(path):
save_uploaded_file(path, request.FILES[file])
else:
return HttpResponseNotFound()
return HttpResponse("Got json data")
def save_uploaded_file(path,f):
with open(path, 'wb+') as destination:
for chunk in f.chunks():
destination.write(chunk)
I'd just like to add to #Pithikos's answer by modifying the parser to accept lists as well, in line with how DRF parses lists in serializers in utils/html#parse_html_list
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
# Any 'File' found having application/json as type will be moved to data
mutable_data = data.data.copy()
unmarshaled_blob_names = []
json_parser = parsers.JSONParser()
for name, blob in data.files.items():
if blob.content_type == 'application/json' and name not in data.data:
parsed = json_parser.parse(blob)
if isinstance(parsed, list):
# need to break it out into [0], [1] etc
for idx, item in enumerate(parsed):
mutable_data[name+f"[{str(idx)}]"] = item
else:
mutable_data[name] = parsed
unmarshaled_blob_names.append(name)
for name in unmarshaled_blob_names:
del data.files[name]
data.data = mutable_data
return data
The following code worked for me.
from django.core.files.uploadedfile import SimpleUploadedFile
import requests
from typing import Dict
with open(file_path, 'rb') as f:
file = SimpleUploadedFile('Your-Name', f.read())
data: Dict[str,str]
files: Dict[str,SimpleUploadedFile] = {'model_field_name': file}
requests.put(url, headers=headers, data=data, files=files)
requests.post(url, headers=headers, data=data, files=files)
'model_field_name' is the name of the FileField or ImageField in your Django model. You can pass other data as name or location as usual by using data parameter.
Hope this helps.
This work for me:
class FileUpload(APIView):
parser_classes = [MultiPartParser]
authentication_classes = [JWTAuthentication]
def post(self, request, filename, format=None):
file = request.data['file']
data = json.loads(request.POST['form'])
#.... just do.....
.
.
.
frontend part: example with fetch (vue frontend)
let data = await new FormData(); // creates a new FormData object
data.append("file", this.files); // add your file to form data
data.append('form',JSON.stringify(body)) //add your json
fetch(`https://endpoint/FileUpload/${body.nombre}`, {
method: 'POST',
body: data,
headers: {Authorization: `Bearer ${accessToken}`}
})
I hope this helps.