I opened and read csv file from argv to dictionary
data = open(argv[1])
reader = csv.DictReader(data)
dict_list = []
for line in reader:
dict_list.append(line)
and now when I want to access the content of the csv file like this:
for x in dict_list[0]:
print(x)
All I get is this:
"OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])"
With this function:
for x in dict_list[0]:
print(x)
I get this result:
name
AGATC
AATG
TATC
Can you help me to access 'Alice', '2', '8' and '3'.
You can iterate through the dictionary a couple ways.
let's initialize the dictionary with your values:
from collections import OrderedDict
dict_list = OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])
which gets us:
OrderedDict([('name', 'Alice'), ('AGATC', '2'), ('AATG', '8'), ('TATC', '3')])
you can then iterate through each key and then query the value attached:
for k in dict_list:
print(f"key={k}, value={dict_list[k]}")
and you will get:
key=name, value=Alice
key=AGATC, value=2
key=AATG, value=8
key=TATC, value=3
or, you can get both the key and the value at the same time:
for (k, v) in dict_list.items():
print(f"key={k}, value={v}")
which will get you the same output:
key=name, value=Alice
key=AGATC, value=2
key=AATG, value=8
key=TATC, value=3
I made my OrderedDict dict_list a dictionary and now I can access to the values of keys:
for x in dict_list:
temp = dict(x)
for y in types_count:
print(temp.get(y))
How to standardize the output of USQL to have data for all the columns when converted from JSON
We have a requirement to standardize the output of USQL. The USQL reads the JSON (source file) data and convert it to csv format. The problem is that the number of columns we have for each row in the csv is not the same because of missing data in the JSON. Sometimes the result set of USQL have a row in csv with "N" columns, another row is with "N+1" columns (cells). We would like to standardize the output, having the same number columns in csv for all the rows. How do we achieve this? We don't have any control over source file data, we would need to do the standardization while processing. Has anyone faced similar challenges and found a solution? Thanks for your help!
Input details :
{"map": {"key1": 100, "key2": 101, "key3": 102}, "id": 2, "time": 1540300241230}
{"map": {"key1": 200, "key2": 201, "key3": 202 "key4": 203}, "id": 2, "time": 1540320246930}
{"map": {"key1": 300, "key3": 301, "key4": 303}, "id": 2, "time": 1540350246930}
Once the above JSON is converted to CSV based on some calculation
Output as is which is not correct
key1, key2, key3, key4
100, 101, 102
200, 201, 202, 203
300, 301, 303
Value "301" is not associated with the key2
Output expected - # is the default for missing column values
key1, key2, key3, key4
100, 101, 102, #
200, 201, 202, 203
300, #, 301, 303
Later all the headings ( key1, key2..) will be replaced with actual header names ( Pressure, Velocity...etc)
USE DATABASE [ADLSDB];
DECLARE EXTERNAL #INPUT_FILE string = "/adlspath/keyValue.txt";
DECLARE EXTERNAL #PIVOT_FILE string = "/adlspath/pivot.txt";
/* The meta data about the request starts - contents of the file request.json */
#requestData = EXTRACT id int, timestamp string, key string, value int FROM #INPUT_FILE USING Extractors.Csv();
#data = SELECT id AS id, timestamp AS timestamp, key AS key, value AS value FROM #requestData;
DECLARE EXTERNAL #ids string = "key1,key2,key3,key4"; //"external declaration"
#result = SELECT * FROM (SELECT id, timestamp, key, value FROM #data )
AS D PIVOT(SUM(value) FOR key IN(#ids AS heading)) AS P;
OUTPUT #result TO #PIVOT_FILE USING Outputters.Csv(quoting:false, outputHeader:false);
I was able to get close to the solution by using the above code, however I am stuck at passing multiple values to the IN clause. The list of #ids, I will get at compile time of the USQL, but passing it as a comma separated scalar variable does not produce the result. If I pass only one value ( assume key1) then the IN condition matches and output the rows for Key1. Anyone knows how to pass multiple values to IN clause in USQL PIVOT function.
------Updated------------
We were able to solve the problem by using dynamic USQL. One USQL will write the USQL statements to the output in required format. Then another data factory activity will read the dynamically generated USQL.
Sorry for bad English ))
I have an array of ids in my ruby code.
Example:
[[10], [], [10, 1, 3], []]
Can I load User model from MySQL table users in one query by saving grouping?
Example:
[[#<User id=10>], [], [#<User id=10>, #<User id=1>, #<User id=3>], []]
Environment: Ruby 2.5.1 | Rails 5 | MySQL
One of found solution:
I can flat my array of ids and load my model by that array into hash:
hash = User.where(id: array.flatten).index_by(&:id)
Then, during iterating, through array I can load my objects from hash in the right order like that:
array.each do |ids|
users = ids.map { |id| hash[id] }
# do smth
end
This is simple: use flatten method for array:
ids = [[123], [], [123, 1, 3, 4], [70, 80]]
user_ids = ids.flatten.reject(&:blank?).uniq
users = User.where(id: user_ids)
edited:
not optimal (recursive) method for your need:
def map_users_by_id(ids_array:, users:)
result = []
ids_array.each do |array_element|
if (array_element).is_a? Array
result << map_users_by_id(ids_array: array_element, users: users)
else
result << users[array_element]
end
end
return result
end
ids = [[123], [], [123, 1, 3, 4], [70, 80]]
user_ids = ids.flatten.reject(&:blank?).uniq
users = Hash[User.where(id: user_ids).map{|user|[user.id, user]}]
result = map_users_by_id(ids_array: ids, users: users)
I have a dataframe and I am trying to turn the column into a comma separated list. The end goal is to pass this comma seperated list as a list of filtered items in a SQL query.
How do I go about doing this?
> import pandas as pd
>
> mydata = [{'id' : 'jack', 'b': 87, 'c': 1000},
> {'id' : 'jill', 'b': 55, 'c':2000}, {'id' : 'july', 'b': 5555, 'c':22000}]
df = pd.DataFrame(mydata)
df
Expected solution - note the quotes around the ids since they are strings and the items in column titled 'b' since that is a numerical field and the way in which SQL works. I would then eventually send a query like
select * from mytable where ids in (my_ids) or values in (my_values):
my_ids = 'jack', 'jill','july'
my_values = 87,55,5555
I encountered a similar issue and solved it in one line using values and tolist() as
df['col_name'].values.tolist()
So in your case, it will be
my_ids = my_data['id'].values.tolist() # ['jack', 'jill', 'july']
my_values = my_data['b'].values.tolist()
Let's use apply with argument 'reduce=False' then check the dtype of the series and apply the proper argument to join:
df.apply(lambda x: ', '.join(x.astype(str)) if x.dtype=='int64' else ', '.join("\'"+x.astype(str)+"\'"), reduce=False)
Output:
b 87, 55, 5555
c 1000, 2000, 22000
id 'jack', 'jill', 'july'
dtype: object
Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.
Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.
Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66