CSV to dictionary - Python - csv

For homework we have been asked to build a dictionary from CSV
CSV looks something like this:
David,5,6,10,12,15,20
Micheal,9,15,13,20,5,8
John,1,2,5,8,19,10
I want convert CSV to Python Dictionary,
But I don't know, how can i do that?
import csv
from statistics import mean
with open('grades.csv') as FileCsv:
reader = csv.reader(FileCsv)
for index in reader:
name = index[0]
these_grades = list()
for lines in index[1:]:
these_grades.append(int(lines))
print(mean(these_grades))
# Example
# average = dict()
# print('average['John'])
The output should be like that:
John's Mean = 7.5

Instead of creating a list (these_grades = list()), create the list in a dictionary indexed by name:
students = dict()
...
students[name] = list()
Then append grades to the list:
students[name].append(grade)

Related

How can I save some json files generated in a for loop as csv?

Sorry, I am new in coding in Python, I would need to save a json file generated in a for loop as csv for each iteration of the loop.
I wrote a code that works fine to generate the first csv file but then it is overwritten and I did not find a solution yet. Can anyone help me? many thanks
from twarc.client2 import Twarc2
import itertools
import pandas as pd
import csv
import json
import numpy as np
# Your bearer token here
t = Twarc2(bearer_token="AAAAAAAAAAAAAAAAAAAAA....WTW")
# Get a bunch of user handles you want to check:
list_of_names = np.loadtxt("usernames.txt",dtype="str")
# Get the `data` part of every request only, as one list
def get_data(results):
return list(itertools.chain(*[result['data'] for result in results]))
user_objects = get_data(t.user_lookup(users=list_of_names, usernames=True))
for user in user_objects:
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv('output_file.csv')
You need to add a sequence number or some other unique identifier to the filename. The clearest example would be to keep track of a counter, or use a GUID. Below I've used a counter that is initialized before your loop, and is incremented in each iteration. This will produce a list of files like output_file_1.csv, output_file_2.csv, output_file_3.csv and so on.
counter = 0
for user in user_objects:
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv('output_file_' + str(counter) + '.csv')
counter += 1
We convert the integer to a string, and paste it inbetween the name of your file and its extension.
from twarc.client2 import Twarc2
import itertools
import pandas as pd
import csv
import json
import numpy as np
# Your bearer token here
t = Twarc2(bearer_token="AAAAAAAAAAAAAAAAAAAAA....WTW")
# Get a bunch of user handles you want to check:
list_of_names = np.loadtxt("usernames.txt",dtype="str")
# Get the `data` part of every request only, as one list
def get_data(results):
return list(itertools.chain(*[result['data'] for result in results]))
user_objects = get_data(t.user_lookup(users=list_of_names, usernames=True))
for idx, user in enumerate(user_objects):
following = get_data(t.following(user['id']))
# Do something with the lists
print(f"User: {user['username']} Follows {len(following)} -2")
json_string = json.dumps(following)
df = pd.read_json(json_string)
df.to_csv(f'output_file{str(idx)}.csv')

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Convert multiple csv files to json using python

I am trying to convert csv files in a folder to a single json file. Below code does the job, but the issue is, json file has the first csv written several times. Below is the code i tried. I guess i am going wrong with assigning the data variable. Help me fix it
import csv, json, os
dir_path = 'C:/Users/USER/Desktop/output_files'
inputfiles = [file for file in os.listdir(dir_path) if file.endswith('.csv')]
outputfile = "data_backup1.json"
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4))
Expected output: Json file needs to have each csv written only once into it.
if your .csv files and all of the rows do have different ['ID']s, your assigned dictionary keys should be unique. In this case, your dictionary is growing with one entry per reader .csv row.
You have to change the indentation of the jsonfile.write() function as shown below to produce just one .json file. To sort your entries you could add sort_keys=True in this function.
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4, sort_keys=True))

How to iterate through a python dictionary from user input?

Background:
I have a JSON dictionary file as follows:
dictionary = {"Qui": "クイ", "Quiana": "キアナ", "Quick": "クイック", "Quickley": "クイックリー", "Quico": "キコ", "Quiej-Alvarez": "クエイ アルバレス", "Quigg": "クイッグ", "Quigley": "クイグリー", "Quijano": "クイジャーノ", "Quik": "クイック", "Quilici": "クイリチ", "Quill": "クィル"}
Then I will let the user enter as many keys as they want through input, finally return formatted string combined with key.value.
Question:
My code so far gets the job done in a very clunky/incomplete manner. Any advice on how to clean up the code and achieve my goal?
Current code:
import json
import sys, math
import codecs
#Part1
search_term,search_term2 = input("Enter a Name: ").split()
dictionary = {}
keys = dictionary.keys()
values = dictionary.values()
with open ('translation.json', 'r', encoding='utf-8-sig') as f:
term_data = json.load(f)
if search_term.casefold() in term_data:
word = search_term.title()
elif search_term.title() in term_data:
word = search_term.title()
output1 = "{}".format(term_data[search_term])
#Part 2
with open ('translation.json', 'r', encoding='utf-8-sig') as f:
term_data2 = json.load(f)
if search_term2.casefold() in term_data2:
word2 = search_term2.title()
elif search_term2.title() in term_data2:
word2 = search_term2.title()
#else:
#print("Name not found in dictionary.")
output2 = "{}".format(term_data2[search_term2])
print("{}・{}".format(output1,output2))
Your current code can just enter 2 keys which cannot meet your original requirements, I expand as follows, meanwhile make it simpler:
test.py:
import json
import codecs
with open('translation.json', 'r', encoding='utf-8-sig') as f:
term_data = json.load(f)
search_terms = input("Enter a name: ").split()
l = [term_data[i] for i in search_terms if i.casefold() in term_data or i.title() in term_data]
print('.'.join(l))
First we just need to open json file once, it's expensive to do IO operation, we need to avoid to do it again and again.
Second, we needn't repeat term match as you do with Part1, Part2. We can do it in loop, here I use list comprehension.
Finally, explain a litte:
split all user inputs to a list: search_terms
loop a user input terms with for i in search_terms
if the candidate term i's casefold() or title() in dictionary term_data, it's value in dic was put to new list l again, if not do nothing.
at last, use the separator . to join all the needed elements of list.
Output:
~$ python3 test.py
Enter a name: Qui Quill Quiana
クイ.クィル.キアナ

How do I grab info from this json file?

I'm trying to grab some numbers from this json file, but I don't how to do it correctly. This is the json file I am trying to gather information from:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
I've been trying to get this code to work, but I can't figure it out:
import json
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
data["rowSet"] ["1610612737"] ["Atlanta Hawks"]
I'm trying to get the statistics from each team.
The following Python script should do it.
#!/usr/bin/env python
import json
with open('leaguedashteamstats.json') as data_file:
data = json.load(data_file)
# extract headers names
headers = data['resultSets'][0]['headers']
# extract raw json rows
raw_rows = data['resultSets'][0]['rowSet']
team_stats = []
for row in raw_rows:
print row[1] # prints team name
# mixes header names and values and prints them out
for (header, value) in zip(headers, row):
print header, value
print '\n'
Both data and code can be seen here:
https://gist.github.com/cevaris/24d0b7d97677667aedb14059a6959da1#file-1-team-stats-output
Disclaimer: this code doesn't contain any validation, but it should lead you in the right direction:
import json
with open('data.json') as data_file:
data = json.load(data_file)
for rs in data.get('resultSets'):
for r_ in [r for r in rs.get('rowSet') if r[1] == 'Atlanta Hawks']:
print(r_)
You basically need to determine specific keys that you are going to loop through, or obtain.
This should hopefully get you to where you need to be.