issue with connecting data in databricks from data lake and reading JSON into Folium - json

i'm working on something based of this blogpost:
https://python-visualization.github.io/folium/quickstart.html#Getting-Started
specifically part 13 - using Cloropleth maps:
the piece of code they use is the following:
import pandas as pd
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Unemployment Rate (%)",
).add_to(m)
folium.LayerControl().add_to(m)
m
if I use this I get the requested map.
Now I try to do this with my own data; i work in databricks
so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.
source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)
source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv"
df_2 = spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas()
when adjusting the code below:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
So i upload the geo_data parameter as a Sparkdatafram, I get the following error:
ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```
I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.
If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=source_file1, #this gets adjusted directly to data lake
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
I get the error
Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".
So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.
Azure blob storage (data lake)
So can someone with folium knowledge let me know if
A. it is not possible to load the geo_data directly from a datalake ?
B. in what format I need to upload the data ?
any thoughts on this would be helpfull!!!
thanks in advance!

Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.
can't believe i'm this stupid :-)

Related

unable to load csv from GCS bucket to BigQuery table accurately

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

JSONDecodeError: Expecting value: line 1 column 1 (char 0) while getting data from Pokemon API

I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function

How to get data from Australia Bureau of Statistics using pandasmdx

Is there anyone who got ABS data using pandasmsdx library?
Here is the code to get data from European Central Bank (ECB) which is working.
from pandasdmx import Request
ecb = Request('ECB')
flow_response = ecb.dataflow()
print(flow_response.write().dataflow.head())
exr_flow = ecb.dataflow('EXR')
dsd = exr_flow.dataflow.EXR.structure()
data_response = ecb.data(resource_id='EXR', key={'CURRENCY': ['USD', 'JPY']}, params={'startPeriod': '2016'})
However, when I change Request('ECB') to Request('ABS') , error popups in 2nd line saying,
"{ValueError}This agency only supports requests for data, not dataflow."
Is there a way to get data from ABS?
documentation for pandasdmx: https://pandasdmx.readthedocs.io/en/stable/usage.html#basic-usage
Hope this will help
from pandasdmx import Request
Agency_Code = 'ABS'
Dataset_Id = 'ATSI_BIRTHS_SUMM'
ABS = Request(Agency_Code)
data_response = ABS.data(resource_id='ATSI_BIRTHS_SUMM', params={'startPeriod': '2016'})
#This will result into a stacked DataFrame
df = data_response.write(data_response.data.series, parse_time=False)
#A flat DataFrame
data_response.write().unstack().reset_index()
Australian Bureau of Statistics (ABS) only supports to their SDMX-JSON APIs, they don't send SDMX-ML messages like others. That's the reason it doesn't supports dataflow feature.
Please read for further reference: https://pandasdmx.readthedocs.io/en/stable/agencies.html#pre-configured-data-providers

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

How do I query a nested json after loading it with elephant bird

I'm pretty new to HADOOP and pig .
So . I have a single line json files , all have the same schema :
{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"},
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}
I load all of the json files using elephantbird :
data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
So far the only thing that working for me is querying the "name" field which returns bytearray.
b = foreach data generate json#'name' as name
I then tries to convert it to map instead :
c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;
and get
c: {tuple_0: (m:map[])}
and the data looks like :
({([F1#"abc",F2#44...])})
so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.
how do I do it?
a code example will be very helpful as I already googled it a lot.
Thanks
Try this
c = FOREACH data GENERATE flatten(json#'name') as (m:map[]);
The problem is that you don't know how your data is organized in Pig. Use
DESCRIBE data;
to find out what the structure returned by JsonLoader is, and this should give you enough information about how to extract your data.