I'm new to both xpath and html so I'm probably missing something fundamental here. I have a html where I want to extract all the items displayed below. (I'm using scrapy to do my requests, I just need the proper xpath to get the data)
enter image description here
Here I just want to loop through all these items and from there and get some data from inside each item.
for item in response.xpath("//ul[#class='feedArticleList XSText']/li[#class='item']"):
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
The problem is that this get only gives me the first item for all the loops. If I instead use .getall() I then get all the items for every loop (which in my view shouldn't work since I thought I only selected one item at the time in each iteration). Thanks in advance!
It seems you're missing a . in your XPath expression (to "indicate" you're working from a context node).
Replace :
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
For :
yield {'name': item.xpath(".//div[#class='intro lhNormal']").get()}
You do miss smth. fundamental. Python by default does not have xpath() function.
You better use bs4 or lxml libraries.
See an example with lxml:
import lxml.html
import os
doc = lxml.html.parse('http://www.websters-online-dictionary.org')
if doc:
table = []
trs = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for tr in islice(trs, 3):
for td in tr.xpath('td'):
table += td.xpath("/b/text() | /text()")
buffer = ''
for i in range(len(table)):
buffer += table[i]
the full explanation is here.
Related
I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function
I have a chunk of json that has the following format:
{"page":{"size":7,"number":1,"totalPages":1,"totalElements":7,"resultSetId":null,"duration":0},"content":[{"id":"787edc99-e94f-4132-b596-d04fc56596f9","name":"Verification","attributes":{"ruleExecutionClass":"VerificationRule"},"userTags":[],"links":[{"rel":"self","href":"/endpoint/787edc99-e94f-4132-b596-d04fc56596f9","id":"787edc99-e94f-...
Basically the size attribute (in this case) tells me that there are 7 parts to the content section. How do I convert this chunk of json to an array in Perl, and can I do it using the size attribute? Or is there a simpler way like just using decode_json()?
Here is what I have so far:
my $resources = get_that_json_chunk(); # function returns exactly the json you see, except all 7 resources in the content section
my #decoded_json = #$resources;
foreach my $resource (#decoded_json) {
I've also tried something like this:
my $deserialize = from_json( $resources );
my #decoded_json = (#{$deserialize});
I want to iterate over the array and handle the data. I've tried a few different ways because I read a little about array refs, but I keep getting "Not an ARRAY reference" errors and "Can't use string ("{"page":{"size":7,"number":1,"to"...) as an ARRAY ref while "strict refs" in use"
Thank you to Matt Jacob:
my $deserialized = decode_json($resources);
print "$_->{id}\n" for #{$deserialized->{content}};
I have a bunch of pandas dataframes in a list that I need to convert to html tables. The html code for each individual dataframe looks good, however when I append the html to a list I end up with a bunch of \n characters showing on my webpage. Can anyone tell me how to get rid of them?
python code:
dataframe_html = []
table_dic = {}
for df in dataframes:
frame = df.to_html(classes="table table-hover")
dataframe_html.append(frame) #this is the line where all the \n get added
table_dic.update({'dataframe_html':dataframe_html})
return render(request,'InterfaceApp/FileProcessor_results.html',table_dic)
html code:
<div class="table-responsive">
{{ dataframe_html | safe }}
</div>
Shows up like this:
'
Can anyone help me out with this??
To display 3 separate tables, join the list of HTML strings into a single string:
dataframe_html = u''.join(dataframe_html)
for df in dataframes:
frame = df.to_html(classes="table table-hover")
dataframe_html.append(frame)
dataframe_html = u''.join(dataframe_html)
table_dic = {'dataframe_html':dataframe_html}
return render(request,'InterfaceApp/FileProcessor_results.html',table_dic)
FWIW this late in the game:
I had originally had:
response = render_template('table_display.html', query_results=[df_html], query_name='Quality Item Query')
and was getting an entire row of \n characters. Changed to below and the newlines disappeared.
response = render_template('table_display.html', query_results=df_html, query_name='Quality Item Query')
Even later in the game...
I stumbled upon this thread when having the same problem. Hopefully, the below helps someone.
I assigned the results of df.to_html() to a (nested) list and got the new lines when rendering in my Jinja template. The solution inspired by #AlliDeacon was to index the result again when rendering
Python code:
result[0][0] = df.to_html()
Jinja template:
<div>Table: {{ result[0][0][0] }}</div>
Refer output of the following code for illustration of the difference between a (sub-)list and a list element:
df = pd.DataFrame([['a','b']],
columns=['col_A', 'col_B'])
tmp = []
tmp.append(df.to_html(index=False))
print(tmp)
print(tmp[0])
Result:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>col_A</th>
<th>col_B</th>
</tr>
</thead>
<tbody>
<tr>
<td>a</td>
<td>b</td>
</tr>
</tbody>
</table>
Even more than late than late to the game. The issue with the newline characters is that the frame is by default in JSON when passed from your app to Jinja. The \n's are being read literally, but the table gets constructed from what follows inside the JSON string. There are two cases to deal with this, depending on the method of passing your dataframe:
Case 1: Passing a dataframe with render_template:
Pass the frame in your app without the "df.to_html". Within your html use the Jinja2 syntax to obtain a clean frame without any newline characters:
{{ df_html.to_html(classes="your-fancy-table-class") | safe }}
Case 2: Passing with json.dumps to be retrieved in JS
In case you send your frame via a response, e.g. post Ajax request, pass the frame with df.to_html() as you did:
#app.route('/foo', methods=['POST']) #load and return the frame route
def foo():
# df = pd.DataFrame...
return json.dumps({"response": df.to_html}
Then in your JS, load the clean frame without the newline characters in HTML from your response with:
JSON.parse(data).response;
$("#your-table-wrapper").html(response);
I have several urls that i want to open to a specific place and search for a specific name but I'm only getting None returned or [].
I have searched but cannot see an answer that is pertinent to my code.
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
I know its in the last link found on the page.
Thanks in advance for any ideas or comments
If your ouput is "[]" only, it means your output is an array. You have to set the index then => variable[index].
Try this one
print(soup.body.findAll(text='Ashburton')[0])
...where storing it into a variable first would be easier:
search = soup.body.findAll(text='Ashburton')
print(search[0])
This will bring you the first found item.
For printing all found items you could go
search = soup.body.findAll(text='Ashburton')
foreach(entry in search)
print(entry)
Notice this is more pseude-code instead of a working example. I really dont know beautifulsoap.
I am a new learner of Scrapy and encounter a problem. I get several json responses when crawling websites(that part I have already done). I want to fill them in items and then output to one json file. But the output file is not what I expected.
The item class looks like this:
class USLPlayer(scrapy.Item):
ln = scrapy.Field()
fn = scrapy.Field() ...
The original json file structure looks like this:
{"players":{"4752569":{"ln":"Musa","fn":"Yahaya", .... ,"apprvd":"59750"}, "4801435":{"ln":"Ackley","fn":"Brian", ... ,"apprvd":"59750"}, ...}}
The expected result I hope to be looks like this:
{"item" :{"ln":"Musa","fn":"Yahaya", .... ,"apprvd":"59750"}},{"item": {"ln":"Ackley","fn":"Brian", ... ,"apprvd":"59750"}, ...
Basically I hope every item should be separated list.
The code about fill item is:
players = json.loads(plain_text)
for id, player in players["players"].items():
for key, value in player.items():
item = USLPlayer() item[key] = value
yield item
Is there any way I can ouput json file as I expected. Thank you very much for kind answer.
Have you tried the JSON lines feed exporter?
It will output your items as JSON objects one per line. Then, reading the list of players from the file is as easy as using json.loads on each line.