What is "orcl" in the following pyalgotrade code? Further, i have a csv file and how can i upload data from it into my code? - pyalgotrade

I wanted to implement my own strategy for backtesting but am unable to modify the code according to my needs
from pyalgotrade.tools import yahoofinance
yahoofinance.download_daily_bars('orcl', 2000, 'orcl-2000.csv')
from pyalgotrade import strategy
from pyalgotrade.barfeed import yahoofeed
from pyalgotrade.technical import ma
#class to create objects
class MyStrategy(strategy.BacktestingStrategy):
def __init__(self, feed, instrument):
strategy.BacktestingStrategy.__init__(self, feed)
# We want a 15 period SMA over the closing prices.
self.__sma = ma.SMA(feed[instrument].getCloseDataSeries(), 15)
self.__instrument = instrument
def onBars(self, bars):
bar = bars[self.__instrument]
self.info("%s %s" % (bar.getClose(), self.__sma[-1]))
# Load the yahoo feed from the CSV file
feed = yahoofeed.Feed()
feed.addBarsFromCSV("orcl", "orcl-2000.csv")
# Evaluate the strategy with the feed's bars.
myStrategy = MyStrategy(feed, "orcl")
myStrategy.run()

Slightly modified from the documentation
documentation:
from pyalgotrade.tools import yahoofinance;
for instrument in ["AA","ACN"]:
for year in [2015, 2016]:
yahoofinance.download_daily_bars(instrument, year, r'D:\tmp\Trading\%s-%s.csv' % (instrument,year))

"orcl" is the name of the stock Oracle. If you want to use a different stock ticker, place it there.
You need to go to yahoo finance here: http://finance.yahoo.com/q/hp?s=ORCL&a=02&b=12&c=2000&d=05&e=26&f=2015&g=d
then save the file as orcl-2000.csv
This program reads in the orcl-2000.csv file from the directory and prints the prices.
If you want to download the data through python, then use a command like
instrument = "orcl"
feed = yahoofinance.build_feed([instrument], 2011, 2014, ".")
This will make files that say orcl-2011-yahoofinance.csv and so on from 2011 through 2014.

Related

Stripe API: how to export a csv report of Payments with python?

I currently export data as CSV from the Payments-> Export page on the stripe website. I use this data to create some invoices. Is there a way to do this with the Stripe API? I'm interested in the fields: converted_amount, customer_email, card_name that I can find in the exported CSV, but i'm not able to find in the API. I tried the Charge API, but the amount is not converted in EUR. The best thing for me would be have an API the behaves like the export of the CSV as i do now, without entering the stripe website. Is it possible?
Asking the Stripe support they confirmed there is no report to obtain the data i was looking for.
I ended up using the Charge API and the BalanceTransaction API to get the converted amount in Euro. I share the code if anyone has the same needs
import stripe
import pandas as pd
from datetime import datetime
import numpy as np
stripe.api_key= "rk_test_pippo"
from decimal import Decimal
def stripe_get_data(resource, start_date=None, end_date=None, **kwargs):
if start_date:
# convert to unix timestamp
start_date = int(start_date.timestamp())
if end_date:
# convert to unix timestamp
end_date = int(end_date.timestamp())
resource_list = getattr(stripe, resource).list(limit=5, created={"gte": start_date,"lte": end_date}, **kwargs)
lst = []
for i in resource_list.auto_paging_iter():
lst.extend([i])
df = pd.DataFrame(lst)
if len(df) > 0:
df['created'] = pd.to_datetime(df['created'], unit='s')
return df
#extract Charge data
df= stripe_get_data('Charge',pd.to_datetime("2022-08-08 08:32:14"),pd.to_datetime("2022-09-20 08:32:14"))
#check if amount is EUR, if not, take the converted amount from the BalanceTransaction object
result=[]
for i, row in df.iterrows():
if df['currency'][i]=='eur':
result.append('{:.2f}'.format(df['amount'][i]/100))
else:
eur_amount_cents= getattr(stripe.BalanceTransaction.retrieve(df['balance_transaction'][i]), 'amount')
result.append('{:.2f}'.format(eur_amount_cents/100))
print('amount is ' +df['currency'][i]+' converted amount in '+ str(result[i]) +' eur')
df['converted_amount']=result
#convert amount to string because importing to google sheet does conversion
df['converted_amount'] = df['converted_amount'].apply(lambda x: str(x).replace('.',','))
df['customer_email']= df['receipt_email']
df['card_name']= df['billing_details'].apply(lambda x: x.get('name'))
df.to_csv('unified_payments.csv',encoding="utf8")
Yes Stripe has an equivalent Report API. See their Doc and API Reference. Those report results are pretty much the same csv you can download from Dashboard.

unable to load csv from GCS bucket to BigQuery table accurately

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

How to get data from Australia Bureau of Statistics using pandasmdx

Is there anyone who got ABS data using pandasmsdx library?
Here is the code to get data from European Central Bank (ECB) which is working.
from pandasdmx import Request
ecb = Request('ECB')
flow_response = ecb.dataflow()
print(flow_response.write().dataflow.head())
exr_flow = ecb.dataflow('EXR')
dsd = exr_flow.dataflow.EXR.structure()
data_response = ecb.data(resource_id='EXR', key={'CURRENCY': ['USD', 'JPY']}, params={'startPeriod': '2016'})
However, when I change Request('ECB') to Request('ABS') , error popups in 2nd line saying,
"{ValueError}This agency only supports requests for data, not dataflow."
Is there a way to get data from ABS?
documentation for pandasdmx: https://pandasdmx.readthedocs.io/en/stable/usage.html#basic-usage
Hope this will help
from pandasdmx import Request
Agency_Code = 'ABS'
Dataset_Id = 'ATSI_BIRTHS_SUMM'
ABS = Request(Agency_Code)
data_response = ABS.data(resource_id='ATSI_BIRTHS_SUMM', params={'startPeriod': '2016'})
#This will result into a stacked DataFrame
df = data_response.write(data_response.data.series, parse_time=False)
#A flat DataFrame
data_response.write().unstack().reset_index()
Australian Bureau of Statistics (ABS) only supports to their SDMX-JSON APIs, they don't send SDMX-ML messages like others. That's the reason it doesn't supports dataflow feature.
Please read for further reference: https://pandasdmx.readthedocs.io/en/stable/agencies.html#pre-configured-data-providers

How to use PySpark to load a rolling window from daily files?

I have a large number of fairly large daily files stored in a blog storage engine(S3, Azure datalake exc.. exc..) data1900-01-01.csv, data1900-01-02.csv,....,data2017-04-27.csv. My goal is to preform a rolling N-day linear regression but I am having trouble with the data loading aspect. I am not sure how to do this without nested RDD's.
The schema for every .csv file is the same.
In other words for every date d_t, I need data x_t and to join data (x_t-1, x_t-2,... x_t-N).
How can I use PySpark to load an N-day Window of these daily files? All of the PySpark examples I can find seem to load from one very large file or data set.
Here's an example of my current code:
dates = [('1995-01-03', '1995-01-04', '1995-01-05'), ('1995-01-04', '1995-01-05', '1995-01-06')]
p = sc.parallelize(dates)
def test_run(date_range):
dt0 = date_range[-1] #get the latest date
s = '/daily/data{}.csv'
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
file_list = [s.format(dt) for dt in date_range[:-1]] # Get a window of trailing dates
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
return 1
p.filter(test_run)
p.map(test_run) #fails with same error as p.filter
I'm on PySpark version '2.1.0'
I'm running this on an Azure HDInsight cluster jupyter notebook.
spark here is of type <class 'pyspark.sql.session.SparkSession'>
A smaller more reproducible example is as follows:
p = sc.parallelize([1, 2, 3])
def foo(date_range):
df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
return 1
p.filter(foo).count()
You are better off with using Dataframes rather than RDD. Dataframe's read.csv api accepts list of paths like -
pathList = ['/path/to/data1900-01-01.csv','/path/to/data1900-01-02.csv']
df = spark.read.csv(pathList)
have a look at documentation for read.csv
You can form the list of paths to date files to your data files by doing some date operation over window of N days like "path/to/data"+datetime.today().strftime("%Y-%m-%d"))+.csv" (This will get you file name of today only but its not hard to figure out date calculation for N days)
However keep in mind that schema of all date csvs should be same for above to work.
edit : When you parallelize list of dates i.e. p, each date gets processed individually by different executors, so input to test_run2 wasnt really as list of dates, it was one individual string like 1995-01-01
Try this instead, see if this works.
# Get the list of dates
date_range = window(dates, N)
s = '/daily/data{}.csv'
dt0 = date_range[-1] # most recent file
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
# read previous files
file_list = [s.format(dt) for dt in date_range[:-1]]
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
r, resid = computeLinearRegression(df0,df1)
r.write.save('daily/r{}.csv'.format(dt0))
resid.write.save('/daily/resid{}.csv'.format(dt0))

Using Python's csv.dictreader to search for specific key to then print its value

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.