Merge three csv files with same headers in Python - csv

I have multiple CSVs; however, I'm having difficulty merging them as they all have the same headers. Here's an example.
CSV 1:
ID,COUNT
1,3037
2,394
3,141
5,352
7,31
CSV 2:
ID, COUNT
1,375
2,1178
3,1238
5,2907
6,231
7,2469
CSV 3:
ID, COUNT
1,675
2,7178
3,8238
6,431
7,6469
I need to combine all the CSV file on the ID, and create a new CSV with additional columns for each count column.
I've been testing it with 2 CSVs but I'm still not getting the right output.
with open('csv1.csv', 'r') as checkfile: #CSV Data is pulled from
checkfile_result = {record['ID']: record for record in csv.DictReader(checkfile)}
with open('csv2.csv', 'r') as infile:
#infile_result = {addCount['COUNT']: addCount for addCount in csv.Dictreader(infile)}
with open('Result.csv', 'w') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile, reader.fieldnames + ['COUNT'])
writer.writeheader()
for item in reader:
record = checkfile_result.get(item['ID'], None)
if record:
item['ID'] = record['COUNT'] # ???
item['COUNT'] = record['COUNT']
else:
item['COUNT'] = None
item['COUNT'] = None
writer.writerow(item)
However, with the above code, I get three columns, but the data from the first CSV is populated in both columns. For example.
Result.CSV *Notice the keys skipping the ID that doesn't exist in the CSV
ID, COUNT, COUNT
1, 3037, 3037
2, 394, 394
3,141, 141
5,352. 352
7,31, 31
The result should be:
ID, COUNT, COUNT
1,3037, 375
2,394, 1178
3,141, 1238
5,352, 2907
6, ,231
7,31, 2469
Etc etc
Any help will be greatly appreciated.

This works:
import csv
def read_csv(fobj):
reader = csv.DictReader(fobj, delimiter=',')
return {line['ID']: line['COUNT'] for line in reader}
with open('csv1.csv') as csv1, open('csv2.csv') as csv2, \
open('csv3.csv') as csv3, open('out.csv', 'w') as out:
data = [read_csv(fobj) for fobj in [csv1, csv2, csv3]]
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
out.write('ID COUNT COUNT COUNT\n')
for key in all_keys:
counts = (entry.get(key, '') for entry in data)
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))
The content of the output file:
ID, COUNT, COUNT, COUNT
1, 3037, 375, 675
2, 394, 1178, 7178
3, 141, 1238, 8238
5, 352, 2907,
6, , 231, 431
7, 31, 2469, 6469
The Details
The function read_csv returns a dictionary with the ids as keys and the counst as values. We will use this function to read all three inputs. For example for csv1.csv
with open('csv1.csv') as csv1:
print(read_csv(csv1))
we get this result:
{'1': '3037', '3': '141', '2': '394', '5': '352', '7': '31'}
We need to have all keys. One way is to convert them to sets and use union to find the unique ones. We also sort them:
all_keys = sorted(set(data[0]).union(data[1]).union(data[2]))
['1', '2', '3', '5', '6', '7']
In the loop over all keys, we retrieve the count using entry.get(key, ''). If the key is not contained, we get an empty string. Look at the output file. You see just commas and no values at places were no value was found in the input. We use a generator expression so we don't have to re-type everything three times:
counts = (entry.get(key, '') for entry in data)
This is the content of one of the generators:
list(counts)
('3037', '375', '675')
Finally, we write to our output file. The * converts a tuple like this ('3037', '375', '675') into three arguments, i.e. .format() is called like this .format(key, '3037', '375', '675'):
out.write('{}, {}, {}, {}\n'.format(key, *tuple(counts)))

Related

How to access values in ordereddict?

I opened and read csv file from argv to dictionary
data = open(argv[1])
reader = csv.DictReader(data)
dict_list = []
for line in reader:
dict_list.append(line)
and now when I want to access the content of the csv file like this:
for x in dict_list[0]:
print(x)
All I get is this:
"OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])"
With this function:
for x in dict_list[0]:
print(x)
I get this result:
name
AGATC
AATG
TATC
Can you help me to access 'Alice', '2', '8' and '3'.
You can iterate through the dictionary a couple ways.
let's initialize the dictionary with your values:
from collections import OrderedDict
dict_list = OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])
which gets us:
OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])
you can then iterate through each key and then query the value attached:
for k in dict_list:
print(f"key={k}, value={dict_list[k]}")
and you will get:
key=name, value=Alice
key=AGATC, value=2
key=AATG, value=8
key=TATC, value=3
or, you can get both the key and the value at the same time:
for (k, v) in dict_list.items():
print(f"key={k}, value={v}")
which will get you the same output:
key=name, value=Alice
key=AGATC, value=2
key=AATG, value=8
key=TATC, value=3
I made my OrderedDict dict_list a dictionary and now I can access to the values of keys:
for x in dict_list:
temp = dict(x)
for y in types_count:
print(temp.get(y))

How to standardize the output of USQL to have data for all the columns when converted from JSON

How to standardize the output of USQL to have data for all the columns when converted from JSON
We have a requirement to standardize the output of USQL. The USQL reads the JSON (source file) data and convert it to csv format. The problem is that the number of columns we have for each row in the csv is not the same because of missing data in the JSON. Sometimes the result set of USQL have a row in csv with "N" columns, another row is with "N+1" columns (cells). We would like to standardize the output, having the same number columns in csv for all the rows. How do we achieve this? We don't have any control over source file data, we would need to do the standardization while processing. Has anyone faced similar challenges and found a solution? Thanks for your help!
Input details :
{"map": {"key1": 100, "key2": 101, "key3": 102}, "id": 2, "time": 1540300241230}
{"map": {"key1": 200, "key2": 201, "key3": 202 "key4": 203}, "id": 2, "time": 1540320246930}
{"map": {"key1": 300, "key3": 301, "key4": 303}, "id": 2, "time": 1540350246930}
Once the above JSON is converted to CSV based on some calculation
Output as is which is not correct
key1, key2, key3, key4
100, 101, 102
200, 201, 202, 203
300, 301, 303
Value "301" is not associated with the key2
Output expected - # is the default for missing column values
key1, key2, key3, key4
100, 101, 102, #
200, 201, 202, 203
300, #, 301, 303
Later all the headings ( key1, key2..) will be replaced with actual header names ( Pressure, Velocity...etc)
USE DATABASE [ADLSDB];
DECLARE EXTERNAL #INPUT_FILE string = "/adlspath/keyValue.txt";
DECLARE EXTERNAL #PIVOT_FILE string = "/adlspath/pivot.txt";
/* The meta data about the request starts - contents of the file request.json */
#requestData = EXTRACT id int, timestamp string, key string, value int FROM #INPUT_FILE USING Extractors.Csv();
#data = SELECT id AS id, timestamp AS timestamp, key AS key, value AS value FROM #requestData;
DECLARE EXTERNAL #ids string = "key1,key2,key3,key4"; //"external declaration"
#result = SELECT * FROM (SELECT id, timestamp, key, value FROM #data )
AS D PIVOT(SUM(value) FOR key IN(#ids AS heading)) AS P;
OUTPUT #result TO #PIVOT_FILE USING Outputters.Csv(quoting:false, outputHeader:false);
I was able to get close to the solution by using the above code, however I am stuck at passing multiple values to the IN clause. The list of #ids, I will get at compile time of the USQL, but passing it as a comma separated scalar variable does not produce the result. If I pass only one value ( assume key1) then the IN condition matches and output the rows for Key1. Anyone knows how to pass multiple values to IN clause in USQL PIVOT function.
------Updated------------
We were able to solve the problem by using dynamic USQL. One USQL will write the USQL statements to the output in required format. Then another data factory activity will read the dynamically generated USQL.

Get grouped values by multiple ids in ActiveRecord

Sorry for bad English ))
I have an array of ids in my ruby code.
Example:
[[10], [], [10, 1, 3], []]
Can I load User model from MySQL table users in one query by saving grouping?
Example:
[[#<User id=10>], [], [#<User id=10>, #<User id=1>, #<User id=3>], []]
Environment: Ruby 2.5.1 | Rails 5 | MySQL
One of found solution:
I can flat my array of ids and load my model by that array into hash:
hash = User.where(id: array.flatten).index_by(&:id)
Then, during iterating, through array I can load my objects from hash in the right order like that:
array.each do |ids|
users = ids.map { |id| hash[id] }
# do smth
end
This is simple: use flatten method for array:
ids = [[123], [], [123, 1, 3, 4], [70, 80]]
user_ids = ids.flatten.reject(&:blank?).uniq
users = User.where(id: user_ids)
edited:
not optimal (recursive) method for your need:
def map_users_by_id(ids_array:, users:)
result = []
ids_array.each do |array_element|
if (array_element).is_a? Array
result << map_users_by_id(ids_array: array_element, users: users)
else
result << users[array_element]
end
end
return result
end
ids = [[123], [], [123, 1, 3, 4], [70, 80]]
user_ids = ids.flatten.reject(&:blank?).uniq
users = Hash[User.where(id: user_ids).map{|user|[user.id, user]}]
result = map_users_by_id(ids_array: ids, users: users)

Convert pandas columns to comma separated lists to be used in sql statements

I have a dataframe and I am trying to turn the column into a comma separated list. The end goal is to pass this comma seperated list as a list of filtered items in a SQL query.
How do I go about doing this?
> import pandas as pd
>
> mydata = [{'id' : 'jack', 'b': 87, 'c': 1000},
> {'id' : 'jill', 'b': 55, 'c':2000}, {'id' : 'july', 'b': 5555, 'c':22000}]
df = pd.DataFrame(mydata)
df
Expected solution - note the quotes around the ids since they are strings and the items in column titled 'b' since that is a numerical field and the way in which SQL works. I would then eventually send a query like
select * from mytable where ids in (my_ids) or values in (my_values):
my_ids = 'jack', 'jill','july'
my_values = 87,55,5555
I encountered a similar issue and solved it in one line using values and tolist() as
df['col_name'].values.tolist()
So in your case, it will be
my_ids = my_data['id'].values.tolist() # ['jack', 'jill', 'july']
my_values = my_data['b'].values.tolist()
Let's use apply with argument 'reduce=False' then check the dtype of the series and apply the proper argument to join:
df.apply(lambda x: ', '.join(x.astype(str)) if x.dtype=='int64' else ', '.join("\'"+x.astype(str)+"\'"), reduce=False)
Output:
b 87, 55, 5555
c 1000, 2000, 22000
id 'jack', 'jill', 'july'
dtype: object

Manipulating data in CSV

Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.
Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.
Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66