Manipulating data in CSV - csv

Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.

Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.

Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66

Related

How to Search for row with Json Array Value as a condition in mysql 5.7

I have a table,
t_offices
id name offices
2 london {"officeIds": [1, 33, 13, 1789]}
3 bangkok {"officeIds": [2, 3, 40, 19]}
I can get the array in the json body using
select JSON_EXTRACT(p.`offices`, '$.officeIds[*]') from t_offices p
It leads to
[1, 33, 13, 1789]
[2, 3, 40, 19]
But Now, How to search with a condition that it would have value 33.
i.e
2 London {"officeIds": [1, 33, 13, 1789]}
basically, get the row where a value is inside that json array.
You can try with this query:
SELECT * FROM t_offices WHERE JSON_CONTAINS(offices, '33', '$.officeIds');
OR
SELECT * FROM t_offices WHERE JSON_CONTAINS(offices->'$.officeIds', '33');

PostgreSQL average of JSON Data

I have some data like this:
id heart_rate
1 {0: 28, 1: 25, 2: 38, 3: 42}
2 {0: 30, 1: 28, 2: 43, 3: 58}
3 {0: 42, 1: 29, 2: 98, 3: 38}
I'm trying to return an object with the averaged values, something like this:
{0: 32, 1: 26, 2: 58, 3: 43}
I tried script to loop through and analyze, but with the amount of data a loop through this could take too long and not be practical.
You need to extract all values, cast them to a number, calculate the average and then convert it back to a JSON value:
select to_jsonb(r)
from (
select avg((heart_rate ->> '0')::int) as "0",
avg((heart_rate ->> '1')::int) as "1",
avg((heart_rate ->> '2')::int) as "2",
avg((heart_rate ->> '3')::int) as "3"
from the_table
) r;
If you don't really know the keys, but you know that all of them can be cast to a number, you could do something like this:
select jsonb_object_agg(ky, average)
from (
select r.ky, round(avg(r.val::int)) as average
from the_table
cross join jsonb_each(heart_rate) as r(ky, val)
group by r.ky
) t;
Online example

Convert pandas columns to comma separated lists to be used in sql statements

I have a dataframe and I am trying to turn the column into a comma separated list. The end goal is to pass this comma seperated list as a list of filtered items in a SQL query.
How do I go about doing this?
> import pandas as pd
>
> mydata = [{'id' : 'jack', 'b': 87, 'c': 1000},
> {'id' : 'jill', 'b': 55, 'c':2000}, {'id' : 'july', 'b': 5555, 'c':22000}]
df = pd.DataFrame(mydata)
df
Expected solution - note the quotes around the ids since they are strings and the items in column titled 'b' since that is a numerical field and the way in which SQL works. I would then eventually send a query like
select * from mytable where ids in (my_ids) or values in (my_values):
my_ids = 'jack', 'jill','july'
my_values = 87,55,5555
I encountered a similar issue and solved it in one line using values and tolist() as
df['col_name'].values.tolist()
So in your case, it will be
my_ids = my_data['id'].values.tolist() # ['jack', 'jill', 'july']
my_values = my_data['b'].values.tolist()
Let's use apply with argument 'reduce=False' then check the dtype of the series and apply the proper argument to join:
df.apply(lambda x: ', '.join(x.astype(str)) if x.dtype=='int64' else ', '.join("\'"+x.astype(str)+"\'"), reduce=False)
Output:
b 87, 55, 5555
c 1000, 2000, 22000
id 'jack', 'jill', 'july'
dtype: object

Merge three csv files with same headers in Python

I have multiple CSVs; however, I'm having difficulty merging them as they all have the same headers. Here's an example.
CSV 1:
ID,COUNT
1,3037
2,394
3,141
5,352
7,31
CSV 2:
ID, COUNT
1,375
2,1178
3,1238
5,2907
6,231
7,2469
CSV 3:
ID, COUNT
1,675
2,7178
3,8238
6,431
7,6469
I need to combine all the CSV file on the ID, and create a new CSV with additional columns for each count column.
I've been testing it with 2 CSVs but I'm still not getting the right output.
with open('csv1.csv', 'r') as checkfile: #CSV Data is pulled from
checkfile_result = {record['ID']: record for record in csv.DictReader(checkfile)}
with open('csv2.csv', 'r') as infile:
#infile_result = {addCount['COUNT']: addCount for addCount in csv.Dictreader(infile)}
with open('Result.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, reader.fieldnames + ['COUNT'])
writer.writeheader()
for item in reader:
record = checkfile_result.get(item['ID'], None)
if record:
item['ID'] = record['COUNT'] # ???
item['COUNT'] = record['COUNT']
else:
item['COUNT'] = None
item['COUNT'] = None
writer.writerow(item)
However, with the above code, I get three columns, but the data from the first CSV is populated in both columns. For example.
Result.CSV *Notice the keys skipping the ID that doesn't exist in the CSV
ID, COUNT, COUNT
1, 3037, 3037
2, 394, 394
3,141, 141
5,352. 352
7,31, 31
The result should be:
ID, COUNT, COUNT
1,3037, 375
2,394, 1178
3,141, 1238
5,352, 2907
6, ,231
7,31, 2469
Etc etc
Any help will be greatly appreciated.
This works:
import csv
def read_csv(fobj):
reader = csv.DictReader(fobj, delimiter=',')
return {line['ID']: line['COUNT'] for line in reader}
with open('csv1.csv') as csv1, open('csv2.csv') as csv2, \
open('csv3.csv') as csv3, open('out.csv', 'w') as out:
data = [read_csv(fobj) for fobj in [csv1, csv2, csv3]]
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
out.write('ID COUNT COUNT COUNT\n')
for key in all_keys:
counts = (entry.get(key, '') for entry in data)
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))
The content of the output file:
ID, COUNT, COUNT, COUNT
1, 3037, 375, 675
2, 394, 1178, 7178
3, 141, 1238, 8238
5, 352, 2907,
6, , 231, 431
7, 31, 2469, 6469
The Details
The function read_csv returns a dictionary with the ids as keys and the counst as values. We will use this function to read all three inputs. For example for csv1.csv
with open('csv1.csv') as csv1:
print(read_csv(csv1))
we get this result:
{'1': '3037', '3': '141', '2': '394', '5': '352', '7': '31'}
We need to have all keys. One way is to convert them to sets and use union to find the unique ones. We also sort them:
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
['1', '2', '3', '5', '6', '7']
In the loop over all keys, we retrieve the count using entry.get(key, ''). If the key is not contained, we get an empty string. Look at the output file. You see just commas and no values at places were no value was found in the input. We use a generator expression so we don't have to re-type everything three times:
counts = (entry.get(key, '') for entry in data)
This is the content of one of the generators:
list(counts)
('3037', '375', '675')
Finally, we write to our output file. The * converts a tuple like this ('3037', '375', '675') into three arguments, i.e. .format() is called like this .format(key, '3037', '375', '675'):
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))

Blank parameter with multivalue parameter returns nothing

I have three parameter (#person_id, #Person_name, #Supervisor_name), all have Allow Multiple Values and Allow blank value property enabled.
Columns of the report are Person_id, Person_name, Supervisor_name, Claims_done, average_claims_perday created with dataset table with same columns.
The dataset which return the data has filter in query:
where #person_id in (#person_id)
or [PersonName] in (#Person_name)
or Supervisor_name in (#supervisor_name)
The requirement is out of three parameter, if any of the parameter is blank, then query should gives the result based on the parameters that are selected with multivalued.
For Example: dataset creates the following result.
11, abc, john, 12, 3
22, def, john, 345, 9
33, ghi, bryan, 89, 7
44, jkl, bryan, 45, 6
55, mno, bryan, 60, 7
If I select the parmeter #Person_name = 'mno' and #Supervisor_name = 'John' and kept #person_id blank then it should give the result:
11, abc, john, 12, 3
22, def, john, 345, 9
55, mno, bryan, 60, 7
If I select #person_id = 11, 44 and #Supervisorname = 'John', and left the #Person_name blank, then it should give the result:
11, abc, john, 12, 3
22, def, john, 345, 9
44, jkl, bryan, 45, 6
When I keep any of the parameter blank, the report doesnt shows anything, If I select at least one value for all parameters, it gives perfect result.
Any help is appreciated.
If I understand correctly, your requirements for handling parameters can be rephrased as: If a parameter is set, then filter on it; otherwise don't filter on it.
If that is correct, change the where clause to something like this:
WHERE (Person_id in (#person_id) OR #person_id = '')
AND (PersonName in (#Person_name) OR #Person_name = '')
AND (Supervisor_name in (#supervisor_name) OR #supervisor_name = '')
This means each parameter has to be either satisfied, or has to be blank.