How to import csv file data in aerospike without using aerospike loader?Is there any alternative available for aerospike loader? - csv

I want to import following csv file data in aerospike and want to fire simple select query to display data using python as a client
e.g
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
query = client.query( 'test', 'csvfile' )
query.select( 'policyID', 'statecode' )

You could try to use the python csv module along with Aerospike Python client:
https://docs.python.org/2/library/csv.html
http://www.aerospike.com/docs/client/python/
And do something similar to the following:
import aerospike
import sys
import csv
global rec
rec = {}
csvfile = open('aerospike.csv', "rb")
reader = csv.reader(csvfile)
rownum = 0
for row in reader:
# Save First Row with headers
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
# print (rownum,header[colnum],col)
rec[header[colnum]] = col
colnum += 1
rownum += 1
# print(rownum,rec)
if rec:
client.put(('test', 'demo', str(rownum)), rec)
rec = {}
csvfile.close()
Note: You may need to check the size of your headers and make sure they do not exceed 14 characters.
if not you could get the following:
error: (21L, 'A bin name should not exceed 14 characters limit', 'src/main/conversions.c', 500)

as far as I am aware there is no pre-built tool other than the loader that allows you to import CSV. You could, perhaps, build one using the existing client tools.

Related

How can I save some json files generated in a for loop as csv?

Sorry, I am new in coding in Python, I would need to save a json file generated in a for loop as csv for each iteration of the loop.
I wrote a code that works fine to generate the first csv file but then it is overwritten and I did not find a solution yet. Can anyone help me? many thanks
from twarc.client2 import Twarc2
import itertools
import pandas as pd
import csv
import json
import numpy as np
# Your bearer token here
t = Twarc2(bearer_token="AAAAAAAAAAAAAAAAAAAAA....WTW")
# Get a bunch of user handles you want to check:
list_of_names = np.loadtxt("usernames.txt",dtype="str")
# Get the `data` part of every request only, as one list
def get_data(results):
return list(itertools.chain(*[result['data'] for result in results]))
user_objects = get_data(t.user_lookup(users=list_of_names, usernames=True))
for user in user_objects:
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv('output_file.csv')
You need to add a sequence number or some other unique identifier to the filename. The clearest example would be to keep track of a counter, or use a GUID. Below I've used a counter that is initialized before your loop, and is incremented in each iteration. This will produce a list of files like output_file_1.csv, output_file_2.csv, output_file_3.csv and so on.
counter = 0
for user in user_objects:
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv('output_file_' + str(counter) + '.csv')
counter += 1
We convert the integer to a string, and paste it inbetween the name of your file and its extension.
from twarc.client2 import Twarc2
import itertools
import pandas as pd
import csv
import json
import numpy as np
# Your bearer token here
t = Twarc2(bearer_token="AAAAAAAAAAAAAAAAAAAAA....WTW")
# Get a bunch of user handles you want to check:
list_of_names = np.loadtxt("usernames.txt",dtype="str")
# Get the `data` part of every request only, as one list
def get_data(results):
return list(itertools.chain(*[result['data'] for result in results]))
user_objects = get_data(t.user_lookup(users=list_of_names, usernames=True))
for idx, user in enumerate(user_objects):
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv(f'output_file{str(idx)}.csv')

Single CSV output with data in different columns

I have a number of CSV files with data in the first three columns only. I want to copy data from each CSV file and paste it into one single CSV file in column order. For example data from the first CSV file goes into columns 1,2 and 3 in the output file. Similarly, data from the 2nd CSV goes to columns 4,5, and 6 of the same output CSV file and so on. Any help would be highly appreciated. Thanks.
I have tried the following code but it gets me the output in same columns only.
import glob
import pandas as pd
import time
import numpy as np
start = time.time()
Filename='Combined_Data.csv'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for i in range(len(all_filenames)):
data= pd.read_csv(all_filenames[i],skiprows=23)
data= data.rename({'G1': 'CH1', 'G2': 'CH2','Dis': 'CH3'},axis=1)
data= data[['CH1','CH2','CH3']]
data= data.apply(pd.to_numeric, errors='coerce')
print(all_filenames[i])
if i == 0:
data.to_csv(Filename,sep=',',index=False,header=True,mode='a')
else:
data.to_csv(Filename,sep=',',index=False,header=False,mode='a')
end = time.time()
print((end - start),'Seconds(Execution Time)')
If you don't need to write your own code for this, I'd recommend GoCSV's zip command; it can also handle the CSVs having different numbers of rows.
I have three CSV files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
When I run gocsv zip file2.csv file1.csv file3.csv I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,
GoCSV is pre-built for a number of different OS'es.
Here's how to do it with Python's CSV module, using these files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
The more-memory-intensive option
This accumulates the final CSV one file at a time, expanding a list that represents the final CSV with with each new input CSV.
#!/usr/bin/env python3
import csv
import sys
csv_files = [
'file2.csv',
'file1.csv',
'file3.csv',
]
all = []
for csv_file in csv_files:
with open(csv_file) as f:
reader = csv.reader(f)
rows = list(reader)
len_all = len(all)
# First file, initialize all and continue (skip)
if len_all == 0:
all = rows
continue
# The number of columns in all so far
len_cols = len(all[0])
# Extend all with the new rows
for i, row in enumerate(rows):
# Check to make sure all has as many rows as this file
if i >= len_all:
all.append(['']*len_cols)
all[i].extend(row)
# Finally, pad all rows on the right
len_cols = len(all[0])
for i in range(len(all)):
len_row = len(all[i])
if len_row < len_cols:
col_diff = len_cols - len_row
all[i].extend(['']*col_diff)
writer = csv.writer(sys.stdout)
writer.writerows(all)
The streaming option
This reads-and-writes a line/row at a time.
(this is basically a Python port of the Go code from GoCSV's zip, from above)
import csv
import sys
fnames = [
'file2.csv',
'file1.csv',
'file3.csv',
]
num_files = len(fnames)
readers = [csv.reader(open(x)) for x in fnames]
# Collect "header" lines; each header defines the number
# of columns for its file
headers = []
num_cols = 0
offsets = [0]
for reader in readers:
header = next(reader)
headers.append(header)
num_cols += len(header)
offsets.append(num_cols)
writer = csv.writer(sys.stdout)
# With all headers counted, every row must have this many columns
shell_row = [''] * num_cols
for i, header in enumerate(headers):
start = offsets[i]
end = offsets[i+1]
shell_row[start:end] = header
# Write headers
writer.writerow(shell_row)
# Expect that not all CSVs have the same number of rows; some will "finish" ahead of others
file_is_complete = [False] * num_files
num_complete = 0
# Loop a row at a time...
while True:
# ... for each CSV
for i, reader in enumerate(readers):
if file_is_complete[i]:
continue
start = offsets[i]
end = offsets[i+1]
try:
row = next(reader)
# Put this row in its place in the main row
shell_row[start:end] = row
except StopIteration:
file_is_complete[i] = True
num_complete += 1
except:
raise
if num_complete == num_files:
break
# Done iterating CSVs (for this row), write it
writer.writerow(shell_row)
# Reset for next main row
shell_row = [''] * num_cols
For either, I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Is there a easy way to split a large CSV file with multiline entries?

Hi i have this huge 14Gb CSV file with entries that span multiple lines and would like a easy way to split it, BTW the split command will not work cause it is not aware of how many columns there are on a row and will cut it wrong.
Using XSV (https://github.com/BurntSushi/xsv) is very simple:
xsv split -s 10000 ./outputdir inputFile.csv
-s 10000 to set the number of records to write into each chunk.
import os
import pandas as pd
import numpy as np
data_root = r"/home/glauber/Projetos/nlp/"
fname = r"blogset-br.csv.gz"
this_file = os.path.join(data_root,fname)
assert os.path.exists(this_file), this_file
this_file
column_names = ['postid', 'blogid', 'published', 'title', 'content', 'authorid', 'author_displayName', 'replies_totalItems', 'tags']
parse_dates = ['published']
df_iterator = pd.read_csv(this_file,
skiprows=0,
compression='gzip',
chunksize=1000000,
header=None,
names = column_names,
parse_dates=parse_dates,
index_col=1)
new_df = pd.DataFrame()
count = 0
for df in df_iterator:
filename = 'blogset-br-' + str(count ) + '.csv'
df.to_csv(filename)
count += 1
this is the easiest way i could find

Python 3: Opening multiple .csv files

I want to open multiple csv files (with same data types/columns), save the data into one variable do some stuff to data and save it into one csv file. While I can easily open one file, I can't seem to find a way to open multiple files. Here is my code:
import numpy as np
import csv
from collections import Counter
files = ['11.csv', '12.csv', '13.csv', '14.csv', '15.csv']
with open(files) as csvfile:
info = csv.reader(csvfile, delimiter=',')
info_types = []
records = 0
for row in info:
records = row[2]
call_types.append(records)
stats = Counter(call_types).most_common()
print(stats)
results = stats
resultFile = open("Totals.csv",'w')
wr = csv.writer(resultFile, dialect='excel')
for output in results:
wr.writerow(output)
To make it work, simultaneouly less bug prone and efficient try the following.
# required imports
files = ['11.csv', '12.csv', '13.csv', '14.csv', '15.csv']
with open("outfile","wt") as fw:
writer = csv.writer(fw)
for file in files:
with open(file) as csvfile:
info = csv.reader(csvfile, delimiter=',')
info_types = []
records = 0
for row in info:
# process row but don't store it
# in any list if you
# don't have to(that will defeat the purpose)
# say you get processed_row
writer.writerow(processed_row)
I would do this within a loop. Since you are already appending the data as you are reading from the file.
for f in files:
with open(f) as csvfile:
...