Using Python's csv.dictreader to search for specific key to then print its value - csv

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()

The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}

The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.

Related

Python: Reading and Writing HUGE Json files

I am new to python. So please excuse me if I am not asking the questions in pythonic way.
My requirements are as follows:
I need to write python code to implement this requirement.
Will be reading 60 json files as input. Each file is approximately 150 GB.
Sample structure for all 60 json files is as shown below. Please note each file will have only ONE json object. And the huge size of each file is because of the number and size of the "array_element" array contained in that one huge json object.
{
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"array_element":[]
}
Transformation logic is simple. I need to merge all the array_element from all 60 files and write it into one HUGE json file. That is almost 150GB X 60 will be the size of the output json file.
Questions for which I am requesting your help on:
For reading: Planning on using "ijson" module's ijson.items(file_object, "array_element"). Could you please tell me if ijson.items will "Yield" (that is NOT load the entire file into memory) one item at a time from "array_element" array in the json file? I dont think json.load is an option here because we cannot hold such a huge dictionalry in-memory.
For writing: I am planning to read each item using ijson.item, and do json.dumps to "encode" and then write it to the file using file_object.write and NOT using json.dump since I cannot have such a huge dictionary in memory to use json.dump. Could you please let me know if f.flush() applied in the code shown below is needed? To my understanding, the internal buffer will automatically get flushed by itself when it is full and the size of the internal buffer is constant and wont dynamically grow to an extent that it will overload the memory? please let me know
Are there any better approach to the ones mentioned above for incrementally reading and writing huge json files?
Code snippet showing above described reading and writing logic:
for input_file in input_files:
with open("input_file.json", "r") as f:
objects = ijson.items(f, "array_element")
for item in objects:
str = json.dumps(item, indent=2)
with open("output.json", "a") as f:
f.write(str)
f.write(",\n")
f.flush()
with open("output.json", "a") as f:
f.seek(0,2)
f.truncate(f.tell() - 1)
f.write("]\n}")
Hope I have asked my questions clearly. Thanks in advance!!
The following program assumes that the input files have a format that is predictable enough to skip JSON parsing for the sake of performance.
My assumptions, inferred from your description, are:
All files have the same encoding.
All files have a single position somewhere at the start where "array_element":[ can be found, after which the "interesting portion" of the file begins
All files have a single position somewhere at the end where ]} marks the end of the "interesting portion"
All "interesting portions" can be joined with commas and still be valid JSON
When all of these points are true, concatenating a predefined header fragment, the respective file ranges, and a footer fragment would produce one large, valid JSON file.
import re
import mmap
head_pattern = re.compile(br'"array_element"\s*:\s*\[\s*', re.S)
tail_pattern = re.compile(br'\s*\]\s*\}\s*$', re.S)
input_files = ['sample1.json', 'sample2.json']
with open('result.json', "wb") as result:
head_bytes = 500
tail_bytes = 50
chunk_bytes = 16 * 1024
result.write(b'{"JSON": "fragment", "array_element": [\n')
for input_file in input_files:
print(input_file)
with open(input_file, "r+b") as f:
mm = mmap.mmap(f.fileno(), 0)
start = head_pattern.search(mm[:head_bytes])
end = tail_pattern.search(mm[-tail_bytes:])
if not (start and end):
print('unexpected file format')
break
start_pos = start.span()[1]
end_pos = mm.size() - end.span()[1] + end.span()[0]
if input_files.index(input_file) > 0:
result.write(b',\n')
pos = start_pos
mm.seek(pos)
while True:
if pos + chunk_bytes >= end_pos:
result.write(mm.read(end_pos - pos))
break
else:
result.write(mm.read(chunk_bytes))
pos += chunk_bytes
result.write(b']\n}')
If the file format is 100% predictable, you can throw out the regular expressions and use mm[:head_bytes].index(b'...') etc for the start/end position arithmetic.

dumping list to JSON file creates list within a list [["x", "y","z"]], why?

I want to append multiple list items to a JSON file, but it creates a list within a list, and therefore I cannot acces the list from python. Since the code is overwriting existing data in the JSON file, there should not be any list there. I also tried it by having just an text in the file without brackets. It just creates a list within a list so [["x", "y","z"]] instead of ["x", "y","z"]
import json
filename = 'vocabulary.json'
print("Reading %s" % filename)
try:
with open(filename, "rt") as fp:
data = json.load(fp)
print("Data: %s" % data)#check
except IOError:
print("Could not read file, starting from scratch")
data = []
# Add some data
TEMPORARY_LIST = []
new_word = input("give new word: ")
TEMPORARY_LIST.append(new_word.split())
print(TEMPORARY_LIST)#check
data = TEMPORARY_LIST
print("Overwriting %s" % filename)
with open(filename, "wt") as fp:
json.dump(data, fp)
example and output with appending list with split words:
Reading vocabulary.json
Data: [['my', 'dads', 'house', 'is', 'nice']]
give new word: but my house is nicer
[['but', 'my', 'house', 'is', 'nicer']]
Overwriting vocabulary.json
So, if I understand what you are trying to accomplish correctly, it looks like you are trying to overwrite a list in a JSON file with a new list created from user input. For easiest data manipulation, set up your JSON file in dictionary form:
{
"words": [
"my",
"dad's",
"house",
"is",
"nice"
]
}
You should then set up functions to separate your functionality to make it more manageable:
def load_json(filename):
with open(filename, "r") as f:
return json.load(f)
Now, we can use those functions to load the JSON, access the words list, and overwrite it with the new word.
data = load_json("vocabulary.json")
new_word = input("Give new word: ").split()
data["words"] = new_word
write_json("vocabulary.json", data)
If the user inputs "but my house is nicer", the JSON file will look like this:
{
"words": [
"but",
"my",
"house",
"is",
"nicer"
]
}
Edit
Okay, I have a few suggestions to make before I get into solving the issue. Firstly, it's great that you have delegated much of the functionality of the program over to respective functions. However, using global variables is generally discouraged because it makes things extremely difficult to debug as any of the functions that use that variable could have mutated it by accident. To fix this, use method parameters and pass around the data accordingly. With small programs like this, you can think of the main() method as the point in which all data comes to and from. This means that the main() function will pass data to other functions and receive new or edited data back. One final recommendation, you should only be using all capital letters for variable names if they are going to be constant. For example, PI = 3.14159 is a constant, so it is conventional to make "pi" all caps.
Without using global, main() will look much cleaner:
def main():
choice = input("Do you want to start or manage the list? (start/manage)")
if choice == "start":
data = load_json()
words = data["words"]
dictee(words)
elif choice == "manage":
manage_list()
You can use the load_json() function from earlier (notice that I deleted write_json(), more on that later) if the user chooses to start the game. If the user chooses to manage the file, we can write something like this:
def manage_list():
choice = input("Do you want to add or clear the list? (add/clear)")
if choice == "add":
words_to_add = get_new_words()
add_words("vocabulary.json", words_to_add)
elif choice == "clear":
clear_words("vocabulary.json")
We get the user input first and then we can call two other functions, add_words() and clear_words():
def add_words(filename, words):
with open(filename, "r+") as f:
data = json.load(f)
data["words"].extend(words)
f.seek(0)
json.dump(data, f, indent=4)
def clear_words(filename):
with open(filename, "w+") as f:
data = {"words":[]}
json.dump(data, f, indent=4)
I did not utilize the load_json() function in the two functions above. My reasoning for this is because it would call for opening the file more times than needed, which would hurt performance. Furthermore, in these two functions, we already need to open the file, so it is okayt to load the JSON data here because it can be done with only one line: data = json.load(f). You may also notice that in add_words(), the file mode is "r+". This is the basic mode for reading and writing. "w+" is used in clear_words(), because "w+" not only opens the file for reading and writing, it overwrites the file if the file exists (that is also why we don't need to load the JSON data in clear_words()). Because we have these two functions for writing and/or overwriting data, we don't need the write_json() function that I had initially suggested.
We can then add to the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)add
>>> Please enter the words you want to add, separated by spaces: these are new words
And the JSON file becomes:
{
"words": [
"but",
"my",
"house",
"is",
"nicer",
"these",
"are",
"new",
"words"
]
}
We can then clear the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)clear
And the JSON file becomes:
{
"words": []
}
Great! Now, we implemented the ability for the user to manage the list. Let's move on to creating the functionality for the game: dictee()
You mentioned that you want to randomly select an item from a list and remove it from that list so it doesn't get asked twice. There are a multitude of ways you can accomplish this. For example, you could use random.shuffle:
def dictee(words):
correct = 0
incorrect = 0
random.shuffle(words)
for word in words:
# ask word
# evaluate response
# increment correct/incorrect
# ask if you want to play again
pass
random.shuffle randomly shuffles the list around. Then, you can iterate throught the list using for word in words: and start the game. You don't necessarily need to use random.choice here because when using random.shuffle and iterating through it, you are essentially selecting random values.
I hope this helped illustrate how powerful functions and function parameters are. They not only help you separate your code, but also make it easier to manage, understand, and write cleaner code.

How to add/change column names with pyarrow.read_csv?

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898

How to control quoting on non-numerical entries in a csv file?

I am using Python3's csv module and am wondering why I cannot control quoting correctly. I am using the option quoting = csv.QUOTE_NONNUMERIC but am still seeing all entries quoted. Any idea as to why that is?
Here's my code. Essentially, I am reading in a csv file and want to remove all duplicate lines that have the same text string:
import sys
import csv
class Row:
def __init__(self, row):
self.text, self.a, self.b = row
self.elements = row
with open(sys.argv[2], 'w', newline='') as output:
writer = csv.writer(output, delimiter=';', quotechar='"',
quoting=csv.QUOTE_NONNUMERIC)
with open(sys.argv[1]) as input:
reader = csv.reader(input, delimiter=';')
header = next(reader)
Row.labels = header
assert Row.labels[1] == 'Label1'
writer.writerow(header)
texts = set()
for row in reader:
row_object = Row(row)
if row_object.text not in texts:
writer.writerow(row_object.elements)
texts.add(row_object.text)
When I look at the generated file, the content looks like this:
"Label1";"Label2";"Label3"
"AAA";"123";"456"
...
But I want this:
"Label1";"Label2";"Label3"
"AAA";123;456
...
OK ... I figured it out myself. The answer, I am afraid, was rather simple - and obvious in retrospect. Since the content of each line is obtained from a csv.reader()its elements are strings by default. As a result, the get quoted by the subsequently employed csv.writer().
To be treated as an int, they first need to be cast to an int:
row_object.elements[1]= int(row_object.a)
This explanation can be proven by inserting a type check before and after this cast:
print('Type: {}'.format(type(row_object.elements[1])))

Reading CSV file and generating Dictionaries

I have a CSV file looks like
Hit39, Hit24, Hit9
Hit8, Hit39, Hit21
Hit46, Hit47, Hit20
Hit24, Hit 53, Hit46
I want to read file and create a dictionary based on the first come first serve first basis
like Hit39 : 1, Hit 24:2 and so on ...
but notice Hit39 appeared on column 2 and row2 . So if the reader reads it then it should not append it to dictionary it will move on with the new number.
Once a row number is visited it shouldn't include numbers after that if appeared.
Using Python - Best guess until the OP is clarified - treat the file as though it was one huge list and assign an incrementing variable to unique occurences of value.
import csv
from itertools import count
mydict = {}
counter = count(1)
with open('infile.csv') as fin:
for row in csv.reader(fin, skipinitialspace=True):
for col in row:
mydict[col] = mydict.get(col, next(counter))
Since Python is a popular language that has dictionaries, you must be using Python. At least I assume.
import csv
reader = csv.reader(file("filename.csv"))
d = dict((line[0], 1+lineno) for lineno, line in enumerate(reader))
print d