How to scrape the text by categories and make a json file?

How to scrape the text by categories and make a json file? - json

We scrape the website www.theft-alerts.com. Now we get all the text.
connection = urllib2.urlopen('http://www.theft-alerts.com')
soup = BeautifulSoup(connection.read().replace("<br>","\n"), "html.parser")
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
for wd in sp.select("div.itemindentmodified"):
text = wd.text
if not text.startswith("Images :"):
print(text)
with open("theft-alerts.json", 'w') as outFile:
json.dump(theftalerts, outFile, indent=2)
Output:
STOLEN : A LARGE TAYLORS OF LOUGHBOROUGH BELL
Stolen from Bromyard on 7 August 2014
Item : The bell has a diameter of 37 1/2" is approx 3' tall weighs just shy of half a ton and was made by Taylor's of Loughborough in 1902. It is stamped with the numbers 232 and 11.
The bell had come from Co-operative Wholesale Society's Crumpsall Biscuit Works in Manchester.
Any info to : PC 2361. Tel 0300 333 3000
Messages : Send a message
Crime Ref : 22EJ / 50213D-14
No of items stolen : 1
Location : UK > Hereford & Worcs
Category : Shop, Pub, Church, Telephone Boxes & Bygones
ID : 84377
User : 1 ; Antique/Reclamation/Salvage Trade ; (Administrator)
Date Created : 11 Aug 2014 15:27:57
Date Modified : 11 Aug 2014 15:37:21;
How can we categories the text for the JSON file. The JSON file is now empty.
Output JSON:
[]

You can define a list and append all dictionary objects that you create to the list. e.g:
import json
theftalerts = [];
atheftobject = {};
atheftobject['location'] = 'UK > Hereford & Worcs';
atheftobject['category'] = 'Shop, Pub, Church, Telephone Boxes & Bygones';
theftalerts.append(atheftobject);
atheftobject['location'] = 'UK';
atheftobject['category'] = 'Shop';
theftalerts.append(atheftobject);
with open("theft-alerts.json", 'w') as outFile:
print(json.dump(theftalerts, outFile, indent=2))
After this run the theft-alerts.json will contain this json object:
[
{
"category": "Shop",
"location": "UK"
},
{
"category": "Shop",
"location": "UK"
}
]
You can play with this to generate your own JSON object.
Checkout the json module

Your JSON output remains empty because your loop doesn't append to the list.
Here's how I would extract the category name:
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
item_text = "\n".join(
[wd.text for wd in sp.select("div.itemindentmodified")
if not wd.text.startswith("Images :")])
category = sp.find(
'span', {'class': 'itemsmall'}).text.split('\n')[1][11:]
theftalerts.append({'text': item_text, 'category': category})

Related

pandas json normalize key error with a particular json attribute

I have a json as:
mytestdata = {
"success": True,
"message": "",
"data": {
"totalCount": 95,
"goal": [
{
"user_id": 123455,
"user_email": "john.smith#test.com",
"user_first_name": "John",
"user_last_name": "Smith",
"people_goals": [
{
"goal_id": 545555,
"goal_name": "test goal name",
"goal_owner": "123455",
"goal_narrative": "",
"goal_type": {
"id": 1,
"name": "Team"
},
"goal_create_at": "1595874095",
"goal_modified_at": "1595874095",
"goal_created_by": "123455",
"goal_updated_by": "123455",
"goal_start_date": "1593561600",
"goal_target_date": "1601424000",
"goal_progress": "34",
"goal_progress_color": "#ff9933",
"goal_status": "1",
"goal_permission": "internal,team",
"goal_category": [],
"goal_owner_full_name": "John Smith",
"goal_team_id": "766754",
"goal_team_name": "",
"goal_workstreams": []
}
]
}
]
}
}
I am trying to display all details in "people_goals" along with "user_last_name", "user_first_name","user_email", "user_id" with json_normalize.
So far I am able to display "people_goals", "user_first_name","user_email" with the code
df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email']], errors='ignore')
However I am having issue when trying to include ['goal', 'user_id'] in the meta=[]
The error is:
TypeError Traceback (most recent call last)
<ipython-input-192-b7a124a075a0> in <module>
7 df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
8 meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email'], ['goal','user_id']],
----> 9 errors='ignore')
10
11 # df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'])
The only difference I see for 'user_id' is that it is not a string
Am I missing something here?

Your code works on my platform. I've migrated away from using record_path and meta parameters for two reasons. a) they are difficult to work out b) there are compatibility issues between versions of pandas
Therefore I now use approach of use json_normalize() multiple times to progressively expand JSON. Or use pd.Series. Have included both as examples.
df = pd.json_normalize(data=mytestdata['data']).explode("goal")
df = pd.concat([df, df["goal"].apply(pd.Series)], axis=1).drop(columns="goal").explode("people_goals")
df = pd.concat([df, df["people_goals"].apply(pd.Series)], axis=1).drop(columns="people_goals")
df = pd.concat([df, df["goal_type"].apply(pd.Series)], axis=1).drop(columns="goal_type")
df.T
df2 = pd.json_normalize(pd.json_normalize(
pd.json_normalize(data=mytestdata['data']).explode("goal").to_dict(orient="records")
).explode("goal.people_goals").to_dict(orient="records"))
df2.T
print(df.T.to_string())
output
0
totalCount 95
user_id 123455
user_email john.smith#test.com
user_first_name John
user_last_name Smith
goal_id 545555
goal_name test goal name
goal_owner 123455
goal_narrative
goal_create_at 1595874095
goal_modified_at 1595874095
goal_created_by 123455
goal_updated_by 123455
goal_start_date 1593561600
goal_target_date 1601424000
goal_progress 34
goal_progress_color #ff9933
goal_status 1
goal_permission internal,team
goal_category []
goal_owner_full_name John Smith
goal_team_id 766754
goal_team_name
goal_workstreams []
id 1
name Team

correct format for json file...then to dataframe

I have a notepad file which I save as a json file and I'm trying to read it in pandas dataframe.
My json file looks like this:
{
"date" : "2000-01-01",
"i" : "1387",
"xxx" : "aaaa",
},
{
"fecha" : "2000-01-02",
"indicativo" : "1387",
"xxx" : "aaaa",
},
{
"data" : "2000-01-03",
"indicativo" : "1387",
},
{
"date" : "2000-01-04",
"i" : "1387",
"xxx" : "aaaa",
},
{
"fecha" : "2000-01-05",
"indicativo" : "1387",
"xxx" : "aaaa",
}
How can I change this to the correct json format using code? (Keep in mind that I just posted some lines, the actual json file is hundreds of hundreds of lines so it is impractical for me to do it manually)
And then once I have that file the code would be:
import pandas as pd
from pandas.io.json import json_normalize
name = pd.read_json(r"file.json", lines=True, orient='records')
I tried running the above code with the json file but kept getting :
ValueError: Expected object or value.
After much trial and error I believe it is due to the fact the it is not in correct json format so I would appreciate if someone helps me with at least the first part.

The question addresses How can I change this to the correct json format using code?
Given what is shown in the file as rows of comma and \n separated dictionaries.
Read and fix the file by adding [ to the beginning of a file and ] to the end of the file.
Once the file is fixed, it doesn't need to be fixed again.
Read the file back in with pandas.read_json
The list of dictionaries can be loaded into pandas, but there are different keys in each dict, so some additional cleaning may be necessary.
import json
import pandas as pd
from pathlib import Path
# path to file
p = Path('e:/PythonProjects/stack_overflow/test.json')
# read and fix the file
with p.open('r+') as f:
file = f.read() # reads the file in as a long string
file = '[' + file + ']' # add characters to beginning and end of string
f.seek(0) # find the beginning of the file
f.write(file) # write the new data back to the file
f.truncate() # remove the old data
# after fixing the file with code
df = pd.read_json(p)
# display(df)
date i xxx fecha indicativo data
0 2000-01-01 1387 aaaa NaN NaN NaN
1 NaN NaN aaaa 2000-01-02 1387 NaN
2 NaN NaN NaN NaN 1387 2000-01-03
3 2000-01-04 1387 aaaa NaN NaN NaN
4 NaN NaN aaaa 2000-01-05 1387 NaN

I think your json file should have [] in the beginning and end.

I think your data file is a list of dictionaries, but with opening and closing square brackets missing. (The file is not JSON, as there are dictionaries (values), but no keys).
The response above shows how to add the '[' and ']'.
After you do this, you can call the DataFrame constructor directly:
data = [
{
"date" : "2000-01-01",
"i" : "1387",
"xxx" : "aaaa",
},
{
"fecha" : "2000-01-02",
"indicativo" : "1387",
"xxx" : "aaaa",
},
# remaining dictionaries, omitted, to save space
]
pd.DataFrame(data)

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!

for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf

Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

R: getting google finance JSON data into a dataframe

I am trying to get google finance JSON data into a dataframe.
I tried:
library(jsonlite)
dat1 <- fromJSON("http://www.google.com/finance/info?q=NSE:%20AAPL,MSFT,TSLA,AMZN,IBM")
dat1
However I get an error:
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
Thank you for any help.

I could not replicate your error using fromJSON due to proxy issues from my side but the following works using httr
require(jsonlite)
require(httr)
#Set your proxy setting if needed
#set_config(use_proxy(url='hostname',port= port,username="",password=""))
url.name = "http://www.google.com/finance/info?q=NSE:%20AAPL,MSFT,TSLA,AMZN,IBM"
url.get = GET(url.name)
#parsing the content as json results in similar error as you encountered
#url.content = content(url.get,type="application/json")
#Error in parseJSON(txt) : parse error: trailing garbage
# " : "0.57" ,"yld" : "2.46" } ,{ "id": "358464" ,"t" : "MSFT"
# (right here) ------^
#read content as html text
url.content = content(url.get, as="text")
#remove html tags
clean.text = gsub("<.*?>", "", url.content)
#remove residual text
clean.text = gsub("\\n|\\//","",clean.text)
DF = fromJSON(clean.text)
head(DF[,1:10],5)
# id t e l l_fix l_cur s ltt lt lt_dts
#1 22144 AAPL NASDAQ 92.51 92.51 92.51 1 4:00PM EDT May 11, 4:00PM EDT 2016-05-11T16:00:02Z
#2 358464 MSFT NASDAQ 51.05 51.05 51.05 1 4:00PM EDT May 11, 4:00PM EDT 2016-05-11T16:00:02Z
#3 12607212 TSLA NASDAQ 208.96 208.96 208.96 1 4:00PM EDT May 11, 4:00PM EDT 2016-05-11T16:00:02Z
#4 660463 AMZN NASDAQ 713.23 713.23 713.23 1 4:00PM EDT May 11, 4:00PM EDT 2016-05-11T16:00:02Z
#5 18241 IBM NYSE 148.95 148.95 148.95 2 6:59PM EDT May 11, 6:59PM EDT 2016-05-11T18:59:12Z

I got the below code from here. Let me know if this helps. On a side note, I would also recommend netfonds. Netfonds is the only source I've found that provides intra-day tick level data for both historical prices and the open book. I posted some additional links below for pulling the Netfonds data if you're interested.
http://www.blackarbs.com/blog/3/22/2015/how-to-get-free-intraday-stock-data-from-netfonds
http://www.onestepremoved.com/free-stock-data/
import urllib
from datetime import date, datetime
""" googlefinance
This module provides a Python API for retrieving stock data from Google Finance.
"""
_month_dict = {
'Jan': 1,
'Feb': 2,
'Mar': 3,
'Apr': 4,
'May': 5,
'Jun': 6,
'Jul': 7,
'Aug': 8,
'Sep': 9,
'Oct': 10,
'Nov': 11,
'Dec': 12}
# Google doesn't like Python's user agent...
class FirefoxOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'
def __request(symbol):
url = 'http://google.com/finance/historical?q=%s&output=csv' % symbol
opener = FirefoxOpener()
return opener.open(url).read().strip().strip('"')
def get_historical_prices(symbol, start_date=None, end_date=None):
"""
Get historical prices for the given ticker symbol.
Returns a nested list. fields are Date, Open, High, Low, Close, Volume.
"""
price_data = [data.split(',') for data in __request(symbol).split('\n')[1:]]
for quote in price_data:
quote[0] = _format_date(quote[0])
return price_data
def _format_date(datestr):
""" Change datestr from google format ('20-Jul-12') to the format yahoo uses ('2012-07-20')
"""
parts = datestr.split('-')
day = int(parts[0])
month = _month_dict[parts[1]]
year = int('20'+ parts[2])
return date(year, month, day).strftime('%Y-%m-%d')

If the Google finance endpoint returns a newline delimited json, the solution in R should be:
library(jsonlite)
dat1 <- stream_in(url("http://www.google.com/finance/info?q=NSE:%20AAPL,MSFT,TSLA,AMZN,IBM"))
But it seems the endpoint is not accepting such request (any more?):
HTTP status was '403 Forbidden'

Parsing JSON from Google Distance Matrix API with Corona SDK

So I'm trying to pull data from a JSON string (as seen below). When I decode the JSON using the code below, and then attempt to index the duration text, I get a nil return. I have tried everything and nothing seems to work.
Here is the Google Distance Matrix API JSON:
{
"destination_addresses" : [ "San Francisco, CA, USA" ],
"origin_addresses" : [ "Seattle, WA, USA" ],
"rows" : [
{
"elements" : [
{
"distance" : {
"text" : "1,299 km",
"value" : 1299026
},
"duration" : {
"text" : "12 hours 18 mins",
"value" : 44303
},
"status" : "OK"
}]
}],
"status" : "OK"
}
And here is my code:
local json = require ("json")
local http = require("socket.http")
local myNewData1 = {}
local SaveData1 = function (event)
distanceReturn = ""
distance = ""
local URL1 = "http://maps.googleapis.com/maps/api/distancematrix/json?origins=Seattle&destinations=San+Francisco&mode=driving&&sensor=false"
local response1 = http.request(URL1)
local data2 = json.decode(response1)
if response1 == nil then
native.showAlert( "Data is nill", { "OK"})
print("Error1")
distanceReturn = "Error1"
elseif data2 == nill then
distanceReturn = "Error2"
native.showAlert( "Data is nill", { "OK"})
print("Error2")
else
for i = 1, #data2 do
print("Working")
print(data2[i].rows)
for j = 1, #data2[i].rows, 1 do
print("\t" .. data2[i].rows[j])
for k = 1, #data2[i].rows[k].elements, 1 do
print("\t" .. data2[i].rows[j].elements[k])
for g = 1, #data2[i].rows[k].elements[k].duration, 1 do
print("\t" .. data2[i].rows[k].elements[k].duration[g])
for f = 1, #data2[i].rows[k].elements[k].duration[g].text, 1 do
print("\t" .. data2[i].rows[k].elements[k].duration[g].text)
distance = data2[i].rows[k].elements[k].duration[g].text
distanceReturn = data2[i].rows[k].elements[k].duration[g].text
end
end
end
end
end
end
timer.performWithDelay (100, SaveData1, 999999)

Your loops are not correct. Try this shorter solution.
Replace all your "for i = 1, #data2 do" loop for this one below:
print("Working")
for i,row in ipairs(data2.rows) do
for j,element in ipairs(row.elements) do
print(element.duration.text)
end
end

This question was solved on Corona Forums by Rob Miracle (http://forums.coronalabs.com/topic/47319-parsing-json-from-google-distance-matrix-api/?hl=print_r#entry244400). The solution is simple:
"JSON and Lua tables are almost identical data structures. In this case your table data2 has top level entries:
data2.destination_addresses
data2.origin_addresses
data2.rows
data2.status
Now data2.rows is another table that is indexed by numbers (the [] brackets) but here is only one of them, but its still an array entry:
data.rows[1]
Then inside of it is another numerically indexed table called elements.
So far to get to the element they are (again there is only one of them
data2.rows[1].elements[1]
then it's just accessing the remaining elements:
data2.rows[1].elements[1].distance.text
data2.rows[1].elements[1].distance.value
data2.rows[1].elements[1].duration.text
data2.rows[1].elements[1].duration.value
There is a great table printing function called print_r which can be found in the community code which is great for dumping tables like this to see their structure."

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to scrape the text by categories and make a json file? - json

Related

pandas json normalize key error with a particular json attribute

correct format for json file...then to dataframe

How to Change a value in a Dataframe based on a lookup from a json file

R: getting google finance JSON data into a dataframe

Parsing JSON from Google Distance Matrix API with Corona SDK

Categories

Resources