Python3: Adding equal elements together from json format - json

Data = [{'Ferrari': 51078}, {'Volvo': 83245, 'Ferrari': 70432, 'Skoda':
29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161}, {'Fiat':
27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671}, {'Volvo': 15762}]
I want to add together the numbers for each car, so I get the total amount for each element (the numbers below aren't accurate with the Data and just an example):
Ferrari: 152455
Volvo: 13515
Skoda: 1532
Lambo: 4366
Renault: 4262
Maserati: 2345
Lexus: 235
Jeep: 124
Saab: 15
I've tried with sum(), append it to new lists, collections and many other potential solutions, but I just cannot get this one right. I'm searching for a general solution not only applicable to my problem, so if I change my dataset and hence the numbers and cars, it needs to work also for the new Data.
I'm using Python3.

You can use defaultdict. The code below iterates over the list of dicts. Then taking out a random key-value pair until each dict is empty and summing the results.
from collections import defaultdict
data = [{'Ferrari': 51078},
{'Volvo': 83245, 'Ferrari': 70432, 'Skoda': 29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161},
{'Fiat': 27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671},
{'Volvo': 15762}]
output = defaultdict(int)
for d in data:
while d:
k, v = d.popitem()
output[k] += v
print(output)
Outputs
defaultdict(<class 'int'>, {'Ferrari': 2741097,
'Lambo': 862,
'Skoda': 29428,
'Volvo': 297937,
'Lexus': 161,
'Saab': 2708,
'Jeep': 4025,
'Renault': 5669,
'Maserati': 11030,
'Fiat': 27583})

Related

csv empty strings handling and values appending

With a csv of ~50 rows (stars) and ~30 columns (name, magnitudes and distance), that has some empty string values (''), I am trying to do two things in which all the help so far hasn't been useful. (1) I need to parse empty strings as 0.0, so I can (2) append each row in a list of lists (what I called s).
In other words:
- s is a list of stars (each one has all its parameters)
- d is a particular parameter for all the stars (distance), which I obtain correctly.
Big issue is with s. My try:
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
row[row==''] = '0'
s.append(float(row)) #stars
d.append(arcsec*AU*float(row[30]))
I can't think of a better syntax, and so I get the error
s.append(float(row)) # stars
TypeError: float() argument must be a string or a number
From s I would obtain later the magnitudes for all the stars, separately. But first things first...
#cwasdwa Please look at below code. it will give you an idea. I am sure there might be better way. This solution is based on what I have understood from your code.
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
newRow = [] #create new row array to convert all '' to 0.0
for x in row:
if x =='':
newRow.append(0.0)
else:
newRow.append(x)
s.append(newRow) #stars
if row[30] == '':
value = 0.0
else:
value = row[30]
d.append(arcsec*AU*float(value))

How to find intersection or subset of two CSV files

I have 2 CSV files containing two columns and a large number of rows. The first column is the id, and the second is the set of paired values.
e.g.:
CSV1:
1 {[1,2],[1,4],[5,6],[3,1]}
2 {[2,4] ,[6,3], [8,3]}
3 {[3,2], [5,2], [3,5]}
CSV2:
1 {[2,4] ,[6,3], [8,3]}
2 {[3,4] ,[3,3], [2,3]}
3 {[1,4],[5,6],[3,1],[5,5]}
Now I need to get a CSV file which contains either exact matching items or subset which belongs to both CSVs.
Here the result should be:
{[2,4] ,[6,3], [8,3]}
{[1,4],[5,6],[3,1]}
Can anyone suggest python code to do this?
As suggested by this answer you can use set.intersection to get the intersection of two sets, however this does not work with lists as items. Instead you can also use filter (comparable to this answer):
>>> l1 = [[1,2],[1,4],[5,6],[3,1]]
>>> l2 = [[1,4],[5,6],[3,1],[5,5]]
>>> filter(lambda q: q in l2, l1)
[[1, 4], [5, 6], [3, 1]]
In Python 3 you should convert it to list since there filter returns an iterable:
>>> list(filter(lambda x: x in l2,l1))
You can load CSV files (if they are really comma [or some other character] separated files) with csv.reader or pandas.read_csv for example.

How to decode a csv file with long lines in tensorflow with tf.decode_csv?

How to decode a csv file with long lines(e.g., with many items per line so as not realistic to list them one by one for output) with tf.TextLineReader() and tf.decode_csv?
The typical usage is:
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [1,1,1,1,1]
a,b,c,d,e = tf.decode_csv(records=value,record_defaults=record_defaults, field_delim=" ")
When we have thousands of items in a line, it's impossible to assign them one by one as (a,b,c,d,e) above, can all the items be decoded to a list or something like that?
Lets say you have 1800 columns of data. You can use this as record default:
record_defaults=[[1]]*1800
and then use
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
to read them.
Well, tf.decode_csv returns a list, so you can simply do:
record_defaults = [[1], [1], [1], [1], [1]]
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
all_columns
Out: [<tf.Tensor 'DecodeCSV:0' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:1' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:2' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:3' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:4' shape=() dtype=int32>
]
You can then evaluate it as usual:
sess = tf.Session()
sess.run(all_columns)
Out: [1, 1, 1, 1, 1]
Note that you need to pass a rank 1 record_defaults. If you have some problems with hanging queue.
Here is the way I am mixing differents dtypes in the record_defaults:
record_defaults = [tf.constant(.1, dtype=tf.float32) for count in range(100)] # 5 fp32 features
record_defaults.extend([tf.constant(1, dtype=tf.int32) for count in range(2)]) # 2 int32 features

Calculating the average of a column in csv per hour

I have a csv file that contains data in the following format.
Layer relative_time Ht BSs Vge Temp Message
57986 2:52:46 0.00m 87 15.4 None CMSG
20729 0:23:02 45.06m 82 11.6 None BMSG
20729 0:44:17 45.06m 81 11.6 None AMSG
I want to get read in this csv file and calculate the average BSs for every hour. My csv file is quite huge about 2000 values. However the values are not evenly distributed across every hour. For e.g.
I have 237 samples from hour 3 and only 4 samples from hour 6. Also I should mention that the BSs can be collected from multiple sources.The value always ranges from 20-100. Because of this it is giving a skewed result. For each hour I am calculating the sum of BSs for that hour divided by the number of samples in that hour.
The primary purpose is to understand how BSs evolves over time.
But what is the common approach to this problem. Is this where people apply normalization? It would be great if someone could explain how to apply normalization in such a situation.
The code I am using for my processing is shown below. I believe the code below is correct.
#This 24x2 matrix will contain no of values recorded per hour per hour
hours_no_values = [[0 for i in range(24)] for j in range(2)]
#This 24x2 matrix will contain mean bss stats per hour
mean_bss_stats = [[0 for i in range(24)] for j in range(2)]
with open(PREFINAL_OUTPUT_FILE) as fin, open(FINAL_OUTPUT_FILE, "w",newline='') as f:
reader = csv.reader(fin, delimiter=",")
writer = csv.writer(f)
header = next(reader) # <--- Pop header out
writer.writerow([header[0],header[1],header[2],header[3],header[4],header[5],header[6]]) # <--- Write header
sortedlist = sorted(reader, key=lambda row: datetime.datetime.strptime(row[1],"%H:%M:%S"), reverse=True)
print(sortedlist)
for item in sortedlist:
rel_time = datetime.datetime.strptime(item[1], "%H:%M:%S")
if rel_time.hour not in hours_no_values[0]:
print('item[6] {}'.format(item[6]))
if 'MAN' in item[6]:
print('Hour found {}'.format(rel_time.hour))
hours_no_values[0][rel_time.hour] = rel_time.hour
mean_bss_stats[0][rel_time.hour] = rel_time.hour
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
else:
if 'MAN' in item[6]:
print('Hour Previous {}'.format(rel_time.hour))
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
for i in range(0,24):
if(hours_no_values[1][i] != 0):
mean_bss_stats[1][i] = mean_bss_stats[1][i]/hours_no_values[1][i]
else:
mean_bss_stats[1][i] = 0
pprint.pprint('mean bss stats {} \n hour_no_values {} \n'.format(mean_bss_stats,hours_no_values))
The number of value per each hour are as follows for hours starting from 0 to 23.
[31, 117, 85, 237, 3, 67, 11, 4, 57, 0, 5, 21, 2, 5, 10, 8, 29, 7, 14, 3, 1, 1, 0, 0]
You could do it with pandas using groupby and aggregate to appropriate column:
import pandas as pd
import numpy as np
df = pd.read_csv("your_file")
df.groupby('hour')['BSs'].aggregate(np.mean)
If you don't have that column in initial dataframe you could add it:
df['hour'] = your_hour_data
numpy.mean - calculates the mean of the array.
Compute the arithmetic mean along the specified axis.
pandas.groupby
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
From pandas docs:
By “group by” we are referring to a process involving one or more of the following steps
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
Aggregation: computing a summary statistic (or statistics) about each group.
Some examples:
Compute group sums or means
Compute group sizes / counts

How do I feed in my own data into PyAlgoTrade?

I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.