I am using Python3 and Scrapy.
I have a simple spider (shown bellow) for which I want to save as items the response.url and the response.text. I would like to save the response.text to Notepad++ as a JSON. Is there any way it can be saved with a nested structure? Such as the one that appears in the native HTML og the page?
class Spider1(scrapy.Spider):
name = "Spider1"
allowed_domains = []
start_urls = ['http://www.uam.es/']
def parse(self, response):
items = Spider1Item()
items['url'] = response.url
items['body'] = response.text
yield items
pass
EDIT:
Here is a snippet of the target structure I would like when I export to a JSON.
HTML snippet of target structure
Related
Here I want to store the data from the list given on a website page. If I'm running the commands
response.css('title::text').extract_first() and
response.css("article div#section-2 li::text").extract()
individually in the scrapy shell it is showing expected output in shell.
Below is my code which is not storing data in json or csv format:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "medical"
start_urls = ['https://medlineplus.gov/ency/article/000178.html/']
def parse(self, response):
yield
{
'topic': response.css('title::text').extract_first(),
'symptoms': response.css("article div#section-2 li::text").extract()
}
I have tried to run this code using
scrapy crawl medical -o medical.json
You need to fix your URL, it is https://medlineplus.gov/ency/article/000178.htm and not https://medlineplus.gov/ency/article/000178.html/.
Also, and more importantly, you need to define an Item class and yield/return it from the parse() callback of your spider:
import scrapy
class MyItem(scrapy.Item):
topic = scrapy.Field()
symptoms = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "medical"
allowed_domains = ['medlineplus.gov']
start_urls = ['https://medlineplus.gov/ency/article/000178.htm']
def parse(self, response):
item = MyItem()
item["topic"] = response.css('title::text').extract_first()
item["symptoms"] = response.css("article div#section-2 li::text").extract()
yield item
I have a webpage were I take the RSS links from. The links are XML and I would like to use the XMLFeedSpider functionality to simplify the parsing.
Is that possible?
This would be the flow:
GET example.com/rss (return HTML)
Parse html and get RSS links
foreach link parse XML
I found a simply way based on the existing example in the documentation and looking at the source code. Here is my solution:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def start_request(self):
urls = ['http://www.example.com/get-feed-links']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
for el in response.css("li.feed-links"):
yield scrapy.Request(el.css("a::attr(href)").extract_first(),
callback=self.parse)
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('#id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
I have a bunch of Pandas data frames. I want to view them in HTML (and also want the json). So, this is what I did:
masterDF = concatenated all dfs (pd.concat([df1, df2, df3, ..])
masterDf.to_json(jsonFile, orient = 'records') => this gives a valid json file, but, in a list format.
htmlStr = json2html.convert(json = jsonString)
htmlFile = write htmlStr to a myFile.html.
The json file looks like this:
[{"A":1458000000000,"B":300,"C":1,"sid":101,"D":323.4775570025,"score":0.0726},{"A":1458604800000,"B":6767,"C":1,"sid":101,"D":321.8098393263,"score":0.9524},{"A":1458345600000,"B":9999,"C":3,"sid":29987,"D":125.6096891766,"score":0.9874},{"A":1457827200000,"B":3110,"C":2,"sid":787623,"D":3010.9544668798,"score":0.0318}]
Problem I am facing:
pd.to_json outputs a jsonfile with [] format. Like a list. I am unable to use this json file to load. Like this:
with open(jsonFile) as json_data:
js = json.load(json_data)
htmlStr = json2html.convert(json = js)
return htmlStr
Is there a way to load a json-file like the above and convert to html?
why not use pandas.DataFrame.to_html? (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_html.html)
From Strip HTML from strings in Python, I got help with code
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
for strip_tags(html), do I put an html file name as the parameter. I have a local html file called CID-Sync-0619.html at C:\Python34.
This is my code so far:
Extractfile = open("ExtractFile.txt" , "w")
Extractfile.write(strip_tags(CID-Sync-0619.html))
The entire is actually really long, but they are irrelevant to my question. I want to open another file and extract the text inside the html file to write inside that text file. How do I pass the html file as a parameter? Any help would be appreciated.
I need to extract a part of HTML from a given HTML page. So far, I use the XmlSlurper with tagsoup to parse the HTML page and then try to get the needed part by using the StreamingMarkupBuilder:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)
However, the result I get is
<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>
which looks great, but I would like to get it without the html-namespace.
How do I avoid the namespace?
Turn off the namespace feature on the TagSoup parser. Example:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)