Sorting with csv library, error says my dates don't match '%Y-%m-%d' format when it does - csv

I'm trying to sort a CSV by date first then time second. With Pandas, it was easy by using df = df.sort_values(by=['Date', 'Time_UTC']). In the csv library, the code is (from here):
with open ('eqph_csv_29May2020_noF_5lines.csv') as file:
reader = csv.DictReader(file, delimiter=',')
date_sorted = sorted(reader, key=lambda Date: datetime.strptime('Date', '%Y-%m-%d'))
print(date_sorted)
The datetime documentation clearly says these codes are right. Here's a sample CSV (no delimiter):
Date Time_UTC Latitude Longitude
2020-05-28 05:17:31 16.63 120.43
2020-05-23 02:10:27 15.55 121.72
2020-05-20 12:45:07 5.27 126.11
2020-05-09 19:18:12 14.04 120.55
2020-04-10 18:45:49 5.65 126.54

csv.DictReader returns an iterator that yields a dict for each row in the csv file. To sort it on a column from each row, you need to specify that column in the sort function:
date_sorted = sorted(reader, key=lambda row: datetime.strptime(row['Date'], '%Y-%m-%d'))
To sort on both Date and Time_UTC, you could combine them into one string and convert that to a datetime:
date_sorted = sorted(reader, key=lambda row: datetime.strptime(row['Date'] + ' ' + row['Time_UTC'], '%Y-%m-%d %H:%M:%S'))

Nick's answer worked and used it to revise mine. I used csv.reader() instead.
lon,lat = [],[]
xy = zip(lon,lat)
with open ('eqph_csv_29May2020_noF_20lines.csv') as file:
reader = csv.reader(file, delimiter=',')
next(reader)
date_sorted = sorted(reader, key=lambda row: datetime.strptime
(row[0] + ' ' + row[1], '%Y-%m-%d %H:%M:%S'))
for row in date_sorted:
lon.append(float(row[2]))
lat.append(float(row[3]))
for i in xy:
print(i)
Result
(6.14, 126.2)
(14.09, 121.36)
(13.74, 120.9)
(6.65, 125.42)
(6.61, 125.26)
(5.49, 126.57)
(5.65, 125.61)
(11.33, 124.64)
(11.49, 124.42)
(15.0, 119.79) # 2020-03-19 06:33:00
(14.94, 120.17) # 2020-03-19 06:49:00
(6.7, 125.18)
(5.76, 125.14)
(9.22, 124.01)
(20.45, 122.12)
(5.65, 126.54)
(14.04, 120.55)
(5.27, 126.11)
(15.55, 121.72)
(16.63, 120.43)

Related

For loop with different number of iterations based on datetime

I am trying to get hourly data from a JSON file for a 34-month period. To do this I have created a daterange which I use in a nested loop to get data for each day for all 24 hours. This works fine.
However, because of daylight savings, there are only 23 daily observations on 3 occasions, the first being 2020-03-29. And therefore, I would like to loop only 23 iterations on this date since my loop crashes otherwise.
Below is my code. Right now it gets stuck on the date for SyntaxError: invalid syntax. But there is a high risk it will get stuck on something else when this is fixed.
Thank you.
start_date = date(2020, 1, 1)
end_date = date(2022, 11, 1)
def daterange(start_date, end_date):
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
parsing_range_svk = []
for single_date in daterange(start_date, end_date):
single = single_date.strftime("%Y-%m-%d")
parsing_range_svk.append(single)
######################################
svk =[]
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen("https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1".format(i)).read())
if i == '2020-03-29'
for i in range(23):
rows = data_json_svk['Data'][0]['data'][i]['y']
else:
for i in range(24):
rows = data_json_svk['Data'][0]['data'][i]['y']
svk.append(rows)
Don't check explicitly for a date, rather use list comprehension to get values you need (it will work correctly for 23/24 hours days):
from urllib.request import urlopen
from datetime import date, timedelta
start_date = date(2020, 1, 1)
end_date = date(2022, 11, 1)
def daterange(start_date, end_date):
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
parsing_range_svk = []
for single_date in daterange(start_date, end_date):
single = single_date.strftime("%Y-%m-%d")
parsing_range_svk.append(single)
######################################
url = "https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1"
svk = []
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen(url.format(i)).read())
svk.append([v["y"] for v in data_json_svk["Data"][0]["data"]])
print(svk)

Join and plot data with different times in 10 minute interval

I have 3 tables in an Access database with the same column names (TempDate and Temp), but different time stamps. The data was collected in 10 minute intervals, but each of the recording devices had different start times. I want to merge these into one table with a single TempDate and one Temp column for each of the tables (temp1, temp2, temp3).
I need help on how to do this in either Access or R. I've started using R with MySQL code but I'm still very new at it. Thanks in advance. Ultimately I want to join this data to another dataframe with a datetime stamp from the same period of dates. I think I can manage that if someone can show me how to tell it to group by an interval. Then finally plot using ggplot
Data
temp1<-data.frame(TempDate=c("2020/08/11 07:13:01","2020/08/11 07:23:01","2020/08/11 07:33:01","2020/08/11 07:43:01"),Temperature=c(1.610,-1.905,-1.905,-0.901))
temp2<-data.frame(TempDate=c("2020/08/11 07:10:01","2020/08/11 07:20:01","2020/08/11 07:30:01","2020/08/11 07:40:01"),Temperature=c(15.641,15.641,15.641,15.641))
temp3<-data.frame(TempDate=c("2020/08/11 07:19:01","2020/08/11 07:29:01","2020/08/11 07:39:01","2020/08/11 07:49:01"),Temperature=c(2.062,3.573,4.076,4.579))
> temp3 #as example
TempDate Temperature
1 2020/08/11 07:19:01 2.062
2 2020/08/11 07:29:01 3.573
3 2020/08/11 07:39:01 4.076
4 2020/08/11 07:49:01 4.579
#what I want row 1 is temps recorded from 07:10:00-07:29:59, etc
>
TempDate Temp1 Temp2 Temp3
1 2020/08/11 07:10:00 1.610 15.641 2.062
2 2020/08/11 07:20:00 -1.905 15.641 3.573
3 2020/08/11 07:30:00 -1.905 15.641 4.076
4 2020/08/11 07:40:00 -1.901 15.641 4.579
UPDATE:
Thanks to Ben for the great answer to get me started solving this problem. In asking another question, floor_date was suggested. This code worked better for my data than the cut function by #Ben. When using cut I would get times ending in 9 (12:19) instead of 0 (12:10). I also tried TempDate+60 within the cut function, but then some dates would get a time in the next 10 minute interval. The below code was more accurate.
library(lubridate)
tempdata<-bind_rows(burrow=burrow,shade=shade,sun=sun,.id='Series') %>%
mutate(TempDate = as.POSIXct(TempDate, tz="UTC"),
TimeStamp = floor_date(TempDate, unit='10 mins'),
TimeStamp = as.POSIXct(TimeStamp, tz="UTC")) %>%
filter(TimeStamp > as.POSIXct("2020-08-12 13:29:00", tz="UTC")) %>%
select(Series, Temperature,TimeStamp) %>%
arrange(TimeStamp)
In R you could do the following, using tidyverse approach.
First, you can use bind_rows to put all your data frames together, and add a source column with the name of data frame those temperatures came from, or destination column in final result.
Then, make sure your TempDate is POSIXct. You can use cut to put your datetimes into 10 minute intervals.
At this point, I would consider leaving the result as is for plotting with ggplot2. It's often preferable to leave in "long" format instead of "wide". However, if you want it in "wide" format, then you can use pivot_wider from tidyr.
library(dplyr)
library(tidyr)
bind_rows(temp1 = temp1, temp2 = temp2, temp3 = temp3, .id = 'source') %>%
mutate(TempDate = as.POSIXct(TempDate),
NewTempDate = cut(TempDate, breaks = "10 min")) %>%
pivot_wider(id_cols = NewTempDate, names_from = source, values_from = Temperature)
Output
NewTempDate temp1 temp2 temp3
<fct> <dbl> <dbl> <dbl>
1 2020-08-11 07:10:00 1.61 15.6 2.06
2 2020-08-11 07:20:00 -1.90 15.6 3.57
3 2020-08-11 07:30:00 -1.90 15.6 4.08
4 2020-08-11 07:40:00 -0.901 15.6 4.58
In Access (VBA), you can round the times down like this:
texttime = "2020/08/11 07:19:01"
truetime = DateValue(texttime) + TimeSerial(Hour(CDate(texttime)), (Minute(CDate(texttime)) \ 10) * 10, 0)
' Result:
' 2020-11-08 07:10:00
However, how to implement this in R, I don't know.

Convert Json Date with 12 digits Python Datetime

I am receiving a json which I convert to DataFrame df. One of the column is with Dates this format
/Date(950842800000)/, /Date(1000436400000)/, ...
The problem is that One of this dates has 12 digit and the others 13 digits. This one with 13 digits are converted fine, and with 12 there is a problem. The way I am converting
df["Data"] = df["Date"].apply(lambda x: datetime.fromtimestamp(int(x[6:-2][:10])) if len(x) > 12 else datetime.fromtimestamp(int(x[6:-2][:11])))
doesn´t work for 12 digits.
Thank You for help.
Short:
lambda x: datetime.fromtimestamp(int(x[6:-2][:-3]))
Long:
If you have such input data:
"/Date(950842800000)/",
"/Date(1000436400000)/"
Than a few modifications will make script working correctly:
from datetime import datetime
dates = [
"/Date(950842800000)/",
"/Date(1000436400000)/"
]
for d in dates:
l = lambda x: datetime.fromtimestamp(int(x[6:-2][:9])) if len(x) < 21 else datetime.fromtimestamp(
int(x[6:-2][:10]))
print(l(d))
produce:
2000-02-18 04:00:00
2001-09-14 05:00:00
what is what you expect.
But then we may think to simplicity, you may just use:
from datetime import datetime
dates = [
"/Date(950842800000)/",
"/Date(1000436400000)/"
]
for d in dates:
l = lambda x: datetime.fromtimestamp(int(x[6:-2][:-3]))
print(l(d))

Scala Spark - For loop in Data Frame and compare date

I have a Data Frame which has 3 columns like this:
---------------------------------------------
| x(string) | date(date) | value(int) |
---------------------------------------------
I want to SELECT all the the rows [i] that satisfy all 4 conditions:
1) row [i] and row [i - 1] have the same value in column 'x'
AND
2) 'date' at row [i] == 'date' at row [i - 1] + 1 (two consecutive days)
AND
3) 'value' at row [i] > 5
AND
4) 'value' at row [i - 1] <= 5
I think maybe I need a For loop, but don't know how exactly! Please help me!
Every help is much appreciated!
It can be very easily done with Window functions, look at lag function:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
// test data
val list = Seq(
("x", "2016-12-13", 1),
("x", "2016-12-14", 7)
);
val df = sc.parallelize(list).toDF("x", "date", "value");
// add lags - so read previous value from dataset
val withPrevs = df
.withColumn ("prevX", lag('x, 1).over(Window.orderBy($"date")))
.withColumn ("prevDate", lag('date, 1).over(Window.orderBy($"date")))
.withColumn ("prevValue", lag('value, 1).over(Window.orderBy($"date")))
// filter values and select only needed fields
withPrevs
.where('x === 'prevX)
.where('value > lit(5))
.where('prevValue < lit(5))
.where('date === date_add('prevDate, 1))
.select('x, 'date, 'value)
.show()
Note that without order, i.e. by date, this cannot be done. Dataset has none meaningful order, you must specify order explicity
If you have a DataFrame created, then all you need to do is to call a filter function on DataFrame will all your conditions.
For example:
df1.filter($"Column1" === 2 || $"Column2" === 3)
You can pass as many conditions as you want. It will return you a new DataFrame with filtered data.

Use R or mysql to calculate time period returns?

I'm trying to calculate various time period returns (monthly, quarterly, yearly etc.) for each unique member (identified by Code in the example below) of a data set. The data set will contain monthly pricing information for a 20 year period for approximately 500 stocks. An example of the data is below:
Date Code Price Dividend
1 2005-01-31 xyz 1000.00 20.0
2 2005-01-31 abc 1.00 0.1
3 2005-02-28 xyz 1030.00 20.0
4 2005-02-28 abc 1.01 0.1
5 2005-03-31 xyz 1071.20 20.0
6 2005-03-31 abc 1.03 0.1
7 2005-04-30 xyz 1124.76 20.0
I am fairly new to R, but thought that there would be a more efficient solution than looping through each Code and then each Date as shown here:
uniqueDates <- unique(data$Date)
uniqueCodes <- unique(data$Code
for (date in uniqueDates) {
for (code in uniqueCodes) {
nextDate <- seq.Date(from=stock_data$Date[i], by="3 months",length.out=2)[2]
curPrice <- data$Price[data$Date == date]
futPrice <- data$Price[data$Date == nextDate]
data$ret[(data$Date == date) & (data$Code == code)] <- (futPrice/curPrice)-1
}
}
This method in itself has an issue in that seq.Date does not always return the final day in the month.
Unfortunately the data is not uniform (the number of companies/codes varies over time) so using a simple row offset won't work. The calculation must match the Code and Date with the desired date offset.
I had initially tried selecting the future dates by using the seq.Date function
data$ret = (data[(data$Date == (seq.Date(from = data$Date, by="3 month", length.out=2)[2])), "Price"] / data$Price) - 1
But this generated an error as seq.Date requires a single entry.
> Error in seq.Date(from = stock_data$Date, by = "3 month", length.out =
> 2) : 'from' must be of length 1
I thought that R would be well suited to this type of calculation but perhaps not. Since all the data is in a mysql database I am now thinking that it might be faster/easier to do this calc directly in the database.
Any suggestions would be greatly appreciated.
Load data:
tc='
Date Code Price Dividend
2005-01-31 xyz 1000.00 20.0
2005-01-31 abc 1.00 0.1
2005-02-28 xyz 1030.00 20.0
2005-02-28 abc 1.01 0.1
2005-03-31 xyz 1071.20 20.0
2005-03-31 abc 1.03 0.1
2005-04-30 xyz 1124.76 20.0'
df = read.table(text=tc,header=T)
df$Date=as.Date(df$Date,"%Y-%m-%d")
First I would organize the data by date:
library(plyr)
pp1=reshape(df,timevar='Code',idvar='Date',direction='wide')
Then you would like to obtain monthly, quarterly, yearly, etc returns.
For that there are several options, one could be:
Make the data zoo or xts class. i.e
library(xts)
pp1[2:ncol(pp1)] = as.xts(pp1[2:ncol(pp1)],order.by=pp1$Date)
#let's create a function for calculating returns.
rets<-function(x,lag=1){
return(diff(log(x),lag))
}
Since this database is monthly, the lags for the returns will be:
monthly=1, quaterly=3, yearly =12. for instance let's calculate monthly return
for xyz.
lagged=1 #for monthly
This calculates Monthly returns for xyz
pp1$returns_xyz= c(NA,rets(pp1$Price.xyz,lagged))
To get all the returns:
#create matrix of returns
pricelist= ls(pp1)[grep('Price',ls(pp1))]
returnsmatrix = data.frame(matrix(rep(0,(nrow(pp1)-1)*length(pricelist)),ncol=length(pricelist)))
j=1
for(i in pricelist){
n = which(names(pp1) == i)
returnsmatrix[,j] = rets(pp1[,n],1)
j=j+1
}
#column names
codename= gsub("Price.", "", pricelist, fixed = TRUE)
names(returnsmatrix)=paste('ret',codename,sep='.')
returnsmatrix
You can do this very easily with the quantmod and xts packages. Using the data in AndresT's answer:
library(quantmod) # loads xts too
pp1 <- reshape(df,timevar='Code',idvar='Date',direction='wide')
# create an xts object
x <- xts(pp1[,-1], pp1[,1])
# only get the "Price.*" columns
p <- getPrice(x)
# run the periodReturn function on each column
r <- apply(p, 2, periodReturn, period="monthly", type="log")
# merge prior result into a multi-column object
r <- do.call(merge, r)
# rename columns
names(r) <- paste("monthly.return",
sapply(strsplit(names(p),"\\."), "[", 2), sep=".")
Which leaves you with an r xts object containing:
monthly.return.xyz monthly.return.abc
2005-01-31 0.00000000 0.000000000
2005-02-28 0.02955880 0.009950331
2005-03-31 0.03922071 0.019608471
2005-04-30 0.04879016 NA