Retrieve data in sets Pandas - json

I'm retrieving data from the Open Weather Map API. I have the following code where I'm extracting the current weather from more than 500 cities and I want the log that is giving me separate the data in sets of 50 each
I did a non efficient way that I would really like to improve!
Many many thanks!
x = 1
for index, row in df.iterrows():
base_url = "http://api.openweathermap.org/data/2.5/weather?"
units = "imperial"
query_url = f"{base_url}appid={api_key}&units={units}&q="
city = row['Name'] #this comes from a df
response = requests.get(query_url + city).json()
try:
df.loc[index,"Max Temp"] = response["main"]["temp_max"]
if index < 50:
print(f"Processing Record {index} of Set {x} | {city}")
elif index <100:
x = 2
print(f"Processing Record {index} of Set {x} | {city}")
elif index <150:
x = 3
print(f"Processing Record {index} of Set {x} | {city}")
except (KeyError, IndexError):
pass
print("City not found. Skipping...")

Related

spark rdd fliter by query mysql

I use spark streaming to stream data from Kafka and I want to filter data judge by data in MySql.
For example, I get data from kafka just like:
{"id":1, "data":"abcdefg"}
and there are data in MySql like this:
id | state
1 | "success"
I need to query the MySql to get the state of term id.
I can define a connect to MySql in the function of filter, and it works. The code like this:
def isSuccess(x):
id = x["id"]
sql = """
SELECT *
FROM Test
WHERE id = "{0}"
""".format(id)
conn = mysql_connection(......)
result = rdbi.query_one(sql)
if result == None:
return False
else:
return True
successRDD = rdd.filter(isSuccess)
But it will define connection for every row of the RDD, and will waste a lot of computing resource.
How to do in filter?
I suggest you go for using mapPartition available in Apache Spark to prevent initialization of MySQL connection for every RDD.
This is the MySQL table that I created:
create table test2(id varchar(10), state varchar(10));
With the following values:
+------+---------+
| id | state |
+------+---------+
| 1 | success |
| 2 | stopped |
+------+---------+
Use the following PySpark Code as reference:
import MySQLdb
data1=[["1", "afdasds"],["2","dfsdfada"],["3","dsfdsf"]] #sampe data, in your case streaming data
rdd = sc.parallelize(data1)
def func1(data1):
con = MySQLdb.connect(host="127.0.0.1", user="root", passwd="yourpassword", db="yourdb")
c=con.cursor()
c.execute("select * from test2;")
data=c.fetchall()
dict={}
for x in data:
dict[x[0]]=x[1]
list1=[]
for x in data1:
if x[0] in dict:
list1.append([x[0], x[1], dict[x[0]]])
else:
list1.append([x[0], x[1], "none"]) # i assign none if id in table and one received from streaming dont match
return iter(list1)
print rdd.mapPartitions(func1).filter(lambda x: "none" not in x[2]).collect()
The output that i got was:
[['1', 'afdasds', 'success'], ['2', 'dfsdfada', 'stopped']]

uniting lists out of the googleways package for google directions?

I'm working on looping through long and latitude points for the googleways api. I've come up with two ways to do that in an effort to access the points sections shown in the following link:
https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html
Unforuntaely since this uses a unique key I can't provide a reproducible example but Below are my attempts, one using mapply and the other with a loop. Both work in producing a response in list format, however I am not sure how to unpack it to pull out the points route as you would when passing only one point:
df$routes$overview_polyline$points
Any suggestions?
library(googleway)
dir_results = mapply(
myfunction,
origin = feed$origin,
destination = feed$destination,
departure = feed$departure
)
OR
empty_df = NULL
for (i in 1:nrow(feed)) {
print(i)
output = google_directions(feed[i,"origin"],
feed[i,"destination"],
mode = c("driving"),
departure_time = feed[i,"departure"],
arrival_time = NULL,
waypoints = NULL, alternatives = FALSE, avoid = NULL,
units = c("metric"), key = chi_directions, simplify = T)
empty_df = rbind(empty_df, output)
}
EDIT**
The intended output would be a data frame like below: where "id" represents the original trip fed in.
lat lon id
1 40.71938 -73.99323 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
2 40.71992 -73.99292 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
3 40.71984 -73.99266 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
4 40.71932 -73.99095 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
5 40.71896 -73.98981 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
6 40.71824 -73.98745 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
7 40.71799 -73.98674 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
8 40.71763 -73.98582 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
EDIT****
dput provided for answering question on dataframe to pair list:
structure(list(origin = c("40.7193908691406 -73.9932174682617",
"40.7641792297363 -73.9734268188477", "40.7507591247559 -73.9739990234375"
), destination = c("40.7096214294434-73.9497909545898", "40.7707366943359-73.9031448364258",
"40.7711143493652-73.9871368408203")), .Names = c("origin", "destination"
), row.names = c(NA, 3L), class = "data.frame")
sql code is basic looks like such:
feed = sqlQuery(con, paste("select top 10
longitude as px,
latitude as py,
dlongitude as dx ,
dlatitude as dy,
from mydb"))
and then before feeding it my data frame feed looks like so (u can ignore departure i was using that for the distance api):
origin destination departure
1 40.7439613342285 -73.9958724975586 40.716911315918-74.0121383666992 2017-03-03 01:00:32
2 40.7990493774414 -73.9685516357422 40.8066520690918-73.9610137939453 2017-03-03 01:00:33
3 40.7406234741211 -74.0055618286133 40.7496566772461-73.9834671020508 2017-03-03 01:00:33
4 40.7172813415527 -73.9953765869141 40.7503852844238-73.9811019897461 2017-03-03 01:00:33
5 40.7603607177734 -73.9817123413086 40.7416114807129-73.9795761108398 2017-03-03 01:00:34
As you know the result of the API query returns a list. And if you're doing multiple calls to the API you'll return multiple lists.
So to extract the data of interest you have to do standard operations on lists. In this example it can be done with a couple of *applys
Using the data.frame feed where each row consists of an origin lat/lon (px/py) and a destination lat/lon (dx/dy)
feed <- data.frame(px = c(40.7193, 40.7641),
py = c(-73.993, -73.973),
dx = c(40.7096, 40.7707),
dy = c(-73.949, -73.903))
You can use an apply to query the google_directions() API for each row of the data.frame. And within the same apply you can do whatever you want with the result to extract/format it how you want.
lst <- apply(feed, 1, function(x){
## query Google Directions API
res <- google_directions(key = key,
origin = c(x[['px']], x[['py']]),
destination = c(x[['dx']], x[['dy']]))
## Decode the polyline
df_route <- decode_pl(res$routes$overview_polyline$points)
## append the original coordinates as an 'id' column
df_route[, "id"] <- paste0(paste(x[['px']], x[['py']], sep = "+")
," "
, paste(x[['dx']], x[['dy']], sep = "+")
, collapse = " ")
## store the results of the query, the decoded polyline,
## and the original query coordinates in a list
lst_result <- list(route = df_route,
full_result = res,
origin = c(x[['px']], x[['py']]),
destination = c(x[['dx']],x[['dy']]))
return(lst_result)
})
So now lst is a list that contains the result of each query, plus the decoded polyline as a data.frame. To get all the decoded polylines as a single data.frame you can do another lapply, and then rbind it all together
## do what we want with the result, for example bind all the route coordinates into one data.frame
df <- do.call(rbind, lapply(lst, function(x) x[['route']]))
head(df)
lat lon id
1 40.71938 -73.99323 40.7193+-73.993 40.7096+-73.949
2 40.71992 -73.99292 40.7193+-73.993 40.7096+-73.949
3 40.71984 -73.99266 40.7193+-73.993 40.7096+-73.949
4 40.71932 -73.99095 40.7193+-73.993 40.7096+-73.949
5 40.71896 -73.98981 40.7193+-73.993 40.7096+-73.949
6 40.71824 -73.98745 40.7193+-73.993 40.7096+-73.949

django query return primary key related column values

I'm new to using databases and making django queries to get information.
If I have a table with id as the primary key, and ages and height as other columns, what query would bring me back a dictionary of all the ids and the related ages?
For instance if my table looks like below:
special_id | ages | heights
1 | 5 | x1
2 | 10 | x2
3 | 15 | x3
I'd like to have a key-value pair like {special_id: ages} where special_id is also the primary key.
Is this possible?
Try this:
from django.http import JsonResponse
def get_json(request):
result = MyModel.objects.all().values('id', 'ages') # or simply .values() to get all fields
result_list = list(result) # important: convert the QuerySet to a list object
return JsonResponse(result_list, safe=False)
You will get classic:
{field_name: field_value}
And if you want {field_value: field_value} you can do:
from django.http import JsonResponse
def get_json(request):
result = MyModel.objects.all()
a = {}
for item in result:
a[item.id] = item.age
return JsonResponse(a)

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]