I would like to explode a column to rows in a dataframe on pyspark hive.
There are two columns in the dataframe.
The column "business_id" is a string.
The column "sports_info" is a struct type, each element value is an array of string.
Data:
business_id sports_info
"abc-123" {"sports_type":
["{sport_name:most_recent,
sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
]
}
I need to get a dataframe like:
business_id. sport_id
"abc-123" 568
"abc-123" 171
I defined:
schema = StructType([ \
StructField("sports_type",ArrayType(),True)
])
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)
def get_ids(val):
sports_type = 'sports_type'
sport_events = 'sport_events'
sport_id = 'sport_id'
sport_ids_vals = eval(val.sports_type[0])['sport_events']
ids = [s['sport_id'] for s in sport_ids_scores]
return ids
df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x),
ArrayType(ArrayType(StringType())))('sports_info'))
How could I create the df and extract/explode the inner nested elements?
df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()
Explained:
expr( #use a SQL expression, only way to access transform (pre spark 3)
"transform ( # run a SQL function on an array
sports_type, # declare column to use
x # declare the name of the variable to use for each element in the array
-> # Start writing SQL code to run on each element in the array
regexp_extract( # user SQL regex functions to pull out from the string
x, #string to run regex on
'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
)
THis will likely run faster than a UDF as it can be vectorized.
I'm retrieving data from the Open Weather Map API. I have the following code where I'm extracting the current weather from more than 500 cities and I want the log that is giving me separate the data in sets of 50 each
I did a non efficient way that I would really like to improve!
Many many thanks!
x = 1
for index, row in df.iterrows():
base_url = "http://api.openweathermap.org/data/2.5/weather?"
units = "imperial"
query_url = f"{base_url}appid={api_key}&units={units}&q="
city = row['Name'] #this comes from a df
response = requests.get(query_url + city).json()
try:
df.loc[index,"Max Temp"] = response["main"]["temp_max"]
if index < 50:
print(f"Processing Record {index} of Set {x} | {city}")
elif index <100:
x = 2
print(f"Processing Record {index} of Set {x} | {city}")
elif index <150:
x = 3
print(f"Processing Record {index} of Set {x} | {city}")
except (KeyError, IndexError):
pass
print("City not found. Skipping...")
I have models like below
class Scheduler(models.Model):
id = <this is primary key>
last_run = <referencing to id in RunLogs below>
class RunLogs(models.Model):
id = <primary key>
scheduler = <referencing to id in Scheduler above>
overall_status = <String>
Only when the scheduler reaches the scheduled time of the job, RunLogs entry is created.
Now I am querying on RunLogs to show running schedules as below.
current = RunLog.objects\
.filter(Q(overall_status__in = ("RUNNING", "ON-HOLD", "QUEUED") |
Q(scheduler__last_run__isnull = True))
The above query gives me all records with matching status from RunLogs but does not give me records from Scheduler with last_run is null.
I understand why the query is behaving so but is there a way to get records from scheduler also with last_run is null
?
I just did the same steps which you followed and found the reason why you where getting all the records after running your query. Here is the exact steps and a solution for this.
Steps
Created models
from django.db import models
class ResourceLog(models.Model):
id = models.BigIntegerField(primary_key=True)
resource_mgmt = models.ForeignKey('ResourceMgmt', on_delete=models.DO_NOTHING,
related_name='cpe_log_resource_mgmt')
overall_status = models.CharField(max_length=8, blank=True, null=True)
class ResourceMgmt(models.Model):
id = models.BigIntegerField(primary_key=True)
last_run = models.ForeignKey(ResourceLog, on_delete=models.DO_NOTHING, blank=True, null=True)
Added the data as following:
resource_log
+----+----------------+------------------+
| id | overall_status | resource_mgmt_id |
+----+----------------+------------------+
| 1 | RUNNING | 1 |
| 2 | QUEUED | 1 |
| 3 | QUEUED | 1 |
+----+----------------+------------------+
resource_mgmt
+----+-------------+
| id | last_run_id |
+----+-------------+
| 1 | NULL |
| 2 | NULL |
| 3 | NULL |
| 4 | 3 |
+----+-------------+
According to the above table resource_mgmt(4) is referring to resource_log(3). But thing to be noted is, resource_log(3) is not referring to resource_mgmt(4).
Ran the following command in python shell
In [1]: resource_log1 = ResourceLog.objects.get(id=1)
In [2]: resource_log.resource_mgmt
Out[2]: <ResourceMgmt: ResourceMgmt object (1)>
In [3]: resource_log1 = ResourceLog.objects.get(id=2)
In [4]: resource_log.resource_mgmt
Out[4]: <ResourceMgmt: ResourceMgmt object (1)
In [5]: resource_log1 = ResourceLog.objects.get(id=3)
In [6]: resource_log.resource_mgmt
Out[6]: <ResourceMgmt: ResourceMgmt object (1)>
from this we can understand that all the resource_log objects are referring to 1st object of resource_mgmt(ie, id=1).
Q) Why all the objects are referring to 1st object in the resource_mgmt?
resource_mgmt is a foreign key field which is not null. Its default value is 1. when you create a resource_log object, if you are not specifying resource_mgmt, it will add the default value there which is 1.
Run your query
In [60]: ResourceLog.objects.filter(resource_mgmt__last_run__isnull = True)
Out[60]: <QuerySet [<ResourceLog: ResourceLog object (1)>, <ResourceLog: ResourceLog object (2)>, <ResourceLog: ResourceLog object (3)>]>
This query is returning all three ResourceLog objects because all three are referring to 1st resource_mgmt object which has its is_null value as True
Solution
You actually want to check the reverse relationship.
We can achieve this using two queries:
rm_ids = ResourceMgmt.objects.exclude(last_run=None).values_list('last_run', flat=True)
current = ResourceLog.objects.filter(overall_status__in = ("RUNNING", "QUEUED")).exclude(id__in=rm)
The output is:
<QuerySet [<ResourceLog: ResourceLog object (1)>, <ResourceLog: ResourceLog object (2)>]>
Hope that helps!
I'm new to using databases and making django queries to get information.
If I have a table with id as the primary key, and ages and height as other columns, what query would bring me back a dictionary of all the ids and the related ages?
For instance if my table looks like below:
special_id | ages | heights
1 | 5 | x1
2 | 10 | x2
3 | 15 | x3
I'd like to have a key-value pair like {special_id: ages} where special_id is also the primary key.
Is this possible?
Try this:
from django.http import JsonResponse
def get_json(request):
result = MyModel.objects.all().values('id', 'ages') # or simply .values() to get all fields
result_list = list(result) # important: convert the QuerySet to a list object
return JsonResponse(result_list, safe=False)
You will get classic:
{field_name: field_value}
And if you want {field_value: field_value} you can do:
from django.http import JsonResponse
def get_json(request):
result = MyModel.objects.all()
a = {}
for item in result:
a[item.id] = item.age
return JsonResponse(a)
I'm sure this is a simple SQLContext question, but I can't find any answer in the Spark docs or Stackoverflow
I want to create a Spark Dataframe from a SQL Query on MySQL
For example, I have a complicated MySQL query like
SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...
and I want a Dataframe with Columns X,Y and Z
I figured out how to load entire tables into Spark, and I could load them all, and then do the joining and selection there. However, that is very inefficient. I just want to load the table generated by my SQL query.
Here is my current approximation of the code, that doesn't work. Mysql-connector has an option "dbtable" that can be used to load a whole table. I am hoping there is some way to specify a query
val df = sqlContext.format("jdbc").
option("url", "jdbc:mysql://localhost:3306/local_content").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", "root").
option("password", "").
sql(
"""
select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim o n dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100
"""
).load()
I found this here Bulk data migration through Spark SQL
The dbname parameter can be any query wrapped in parenthesis with an alias. So in my case, I need to do this:
val query = """
(select dl.DialogLineID, dlwim.Sequence, wi.WordRootID from Dialog as d
join DialogLine as dl on dl.DialogID=d.DialogID
join DialogLineWordInstanceMatch as dlwim on dlwim.DialogLineID=dl.DialogLineID
join WordInstance as wi on wi.WordInstanceID=dlwim.WordInstanceID
join WordRoot as wr on wr.WordRootID=wi.WordRootID
where d.InSite=1 and dl.Active=1
limit 100) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:mysql://localhost:3306/local_content").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", "root").
option("password", "").
option("dbtable",query).
load()
As expected, loading each table as its own Dataframe and joining them in Spark was very inefficient.
If you have your table already registered in your SQLContext, you could simply use sql method.
val resultDF = sqlContext.sql("SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...")
to save the output of a query to a new dataframe, simple set the result equal to a variable:
val newDataFrame = spark.sql("SELECT a.X,b.Y,c.Z FROM FOO as a JOIN BAR as b ON ... JOIN ZOT as c ON ... WHERE ...")
and now newDataFrame is a dataframe with all the dataframe functionalities available to it.
TL;DR: just create a view in your database.
Detail:
I have a table t_city in my postgres database, on which I create a view:
create view v_city_3500 as
select asciiname, country, population, elevation
from t_city
where elevation>3500
and population>100000
select * from v_city_3500;
asciiname | country | population | elevation
-----------+---------+------------+-----------
Potosi | BO | 141251 | 3967
Oruro | BO | 208684 | 3936
La Paz | BO | 812799 | 3782
Lhasa | CN | 118721 | 3651
Puno | PE | 116552 | 3825
Juliaca | PE | 245675 | 3834
In the spark-shell:
val sx= new org.apache.spark.sql.SQLContext(sc)
var props=new java.util.Properties()
props.setProperty("driver", "org.postgresql.Driver" )
val url="jdbc:postgresql://buya/dmn?user=dmn&password=dmn"
val city_df=sx.read.jdbc(url=url,table="t_city",props)
val city_3500_df=sx.read.jdbc(url=url,table="v_city_3500",props)
Result:
city_df.count()
Long = 145725
city_3500_df.count()
Long = 6
with MYSQL read/loading data something like below
val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[2]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://<host>:3306/corbonJDBC?user=user&password=password",
"dbtable" -> "TABLE_NAME")).load()
write data to table as below
import java.util.Properties
val prop = new Properties()
prop.put("user", "<>")
prop.put("password", "simple$123")
val dfWriter = jdbcDF.write.mode("append")
dfWriter.jdbc("jdbc:mysql://<host>:3306/corbonJDBC?user=user&password=password", "tableName", prop)
to create dataframe from query do something like below
val finalModelDataDF = {
val query = "select * from table_name"
sqlContext.sql(query)
};
finalModelDataDF.show()