Python csv writer, numpy array to csv - csv

I have Python dict containing 4 key value pairs. Each value is a numpy arrays. Now I want to print the whole dict to a csv, forcing to write one numpy array per row.
with open(os.path.join("csv", title), 'w', newline='') as f:
w = csv.DictWriter(f, list(data.keys()))
w.writeheader()
w.writerow(data)
Is what I have used yet. But some of my arrays get written to several rows instead of a single line.
Here an example of input data:
{'DE': array([[ 38574. , 38538.1904, 39511.6190, 42521.1428,
50586. , 46282.5238, 42714.4761, 40612.0476],
[ 42798.4666, 42112.5333, 42277.8666, 42886.1333,
50224.3333, 48148.8 , 44272.6666, 41210.2 ]])}
I expect the output so that, each line of my array is written on one line. Instead I get a file containing "\n" after a certain amount of digits. how can i force to write the whole array in one row?

DE has a multidimensional array as its value, Inter has an empty list as its value, you end up with two columns one with Inter as the header with an empty list in its column and a second column DE with the array in its column which is exactly what the code should be doing.
If you want to alter each array length try setting numpy.set_printoptions:
numpy.set_printoptions(linewidth=1000)

Related

How to omit the header in when use spark to read csv.file?

I am trying to use Spark to read a csv file in jupyter notebook. So far I have
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()
This is how the reviews_df looks like:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]
But each row of the data frame contains the column names, how can I reformat the data, so that it can become:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]
Dataframe always returns Row objects, thats why when you issue collect() on dataframe, it shows -
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')
to get what you want, you can do -
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()
this will return you tuple of values of rows

Removing characters from column in pandas data frame

My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.

Python "AttributeError: 'NotImplementedType' object has no attribute" when dividing

I've tried to thoroughly research this question before asking it. I'm trying to plot the ratio of two lists that are contained in a dictionary.
line_ids = ['blah1','blah2','blah3','blah4']
elines = {}
for i in range(0,len(line_ids)):
data = []
with open('../output/'+line_ids[i]+'.csv', 'rb') as f:
csvReader = csv.reader(f, delimiter='\t')
for row in csvReader:
data.append(row)
elines[line_ids[i]] = asarray(data)
printing elines['blah1'] from the dictionary gives
[['4.6976281459143071e-40' '3.0049306872382702e-39'
'1.9820026838968144e-38' '1.6041105541709449e-37']
['1.542746402089586e-35' '9.8686046391594954e-35' '6.5092653777796069e-34'
'5.2672534967984846e-33']
['5.1441760072407447e-31' '3.2907875381847918e-30'
'2.1708144830971927e-29' '1.7560195950953601e-28']
['1.7569718535756951e-26' '1.1242245080095899e-25'
'7.4206530085692796e-25' '6.0042313952458629e-24']
['6.2845797115752487e-22' '4.0257542124265526e-21' '2.66528748586604e-20'
'2.1666107897620966e-19']
['2.5547831152324016e-17' '1.6442300355147718e-16'
'1.1022166700730511e-15' '9.1504695119123154e-15']
['1.5754213462395474e-12' '1.0263716591948211e-11'
'7.0896658599931989e-11' '6.1192748118049791e-10']
['2.1154710788925884e-07' '1.3897595085341154e-06'
'9.7645963829243462e-06' '8.3998195937762357e-05']
['0.048187475948250416' '0.31578185949368143' '2.1989098794898618'
'18.120232380010545']
['13029.442003642062' '84972.769876238017' '583770.26053237868'
'4613639.5426874915']
['3726334731.7746887' '24202150828.792419' '164441556532.18036'
'1258809063091.2998']
['1095752351035507.6' '7094645944427608.0' '47806778370222816.0'
'3.5753379508453267e+17']
['3.2816291840091796e+20' '2.1198307401280088e+21'
'1.4197379061068677e+22' '1.0439706837407766e+23']
['9.9600859087036886e+25' '6.4228680979461599e+26'
'4.2823746950774039e+27' '3.1104137359015335e+28']
['3.0534668934520022e+31' '1.9665558653862894e+32'
'1.3068413234720059e+33' '9.406936707924414e+33']
['9.4324018968341618e+36' '6.0691075771818466e+37'
'4.0232499374256741e+38' '2.8769493880535716e+39']]
When I try to divide two lists, I get the following when running the script
print divide(elines['blah1'][0],elines['blah2'][0])
NotImplemented
I thought it might have to do with the numbers being treated as strings within the list, so I tried converting them with the float() function but I get an saying only length-1 arrays can be converted to Python scalars. Ideally, I'd like to plot Column 1 of blah1 vs. Column 1 of blah 2, Column 2 of blah1 vs. Column 2 of blah2, etc.
Any help would be greatly appreciated. Thanks!

Using Python's csv.dictreader to search for specific key to then print its value

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.

Reading CSV file and generating Dictionaries

I have a CSV file looks like
Hit39, Hit24, Hit9
Hit8, Hit39, Hit21
Hit46, Hit47, Hit20
Hit24, Hit 53, Hit46
I want to read file and create a dictionary based on the first come first serve first basis
like Hit39 : 1, Hit 24:2 and so on ...
but notice Hit39 appeared on column 2 and row2 . So if the reader reads it then it should not append it to dictionary it will move on with the new number.
Once a row number is visited it shouldn't include numbers after that if appeared.
Using Python - Best guess until the OP is clarified - treat the file as though it was one huge list and assign an incrementing variable to unique occurences of value.
import csv
from itertools import count
mydict = {}
counter = count(1)
with open('infile.csv') as fin:
for row in csv.reader(fin, skipinitialspace=True):
for col in row:
mydict[col] = mydict.get(col, next(counter))
Since Python is a popular language that has dictionaries, you must be using Python. At least I assume.
import csv
reader = csv.reader(file("filename.csv"))
d = dict((line[0], 1+lineno) for lineno, line in enumerate(reader))
print d