I was wondering what the simplest way is to convert a string representation of a list like the following to a list:
x = '[ "A","B","C" , " D"]'
Even in cases where the user puts spaces in between the commas, and spaces inside of the quotes, I need to handle that as well and convert it to:
x = ["A", "B", "C", "D"]
I know I can strip spaces with strip() and split() and check for non-letter characters. But the code was getting very kludgy. Is there a quick function that I'm not aware of?
>>> import ast
>>> x = '[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']
ast.literal_eval:
With ast.literal_eval you can safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, booleans, and None.
The json module is a better solution whenever there is a stringified list of dictionaries. The json.loads(your_data) function can be used to convert it to a list.
>>> import json
>>> x = '[ "A","B","C" , " D"]'
>>> json.loads(x)
['A', 'B', 'C', ' D']
Similarly
>>> x = '[ "A","B","C" , {"D":"E"}]'
>>> json.loads(x)
['A', 'B', 'C', {'D': 'E'}]
The eval is dangerous - you shouldn't execute user input.
If you have 2.6 or newer, use ast instead of eval:
>>> import ast
>>> ast.literal_eval('["A","B" ,"C" ," D"]')
["A", "B", "C", " D"]
Once you have that, strip the strings.
If you're on an older version of Python, you can get very close to what you want with a simple regular expression:
>>> x='[ "A", " B", "C","D "]'
>>> re.findall(r'"\s*([^"]*?)\s*"', x)
['A', 'B', 'C', 'D']
This isn't as good as the ast solution, for example it doesn't correctly handle escaped quotes in strings. But it's simple, doesn't involve a dangerous eval, and might be good enough for your purpose if you're on an older Python without ast.
There is a quick solution:
x = eval('[ "A","B","C" , " D"]')
Unwanted whitespaces in the list elements may be removed in this way:
x = [x.strip() for x in eval('[ "A","B","C" , " D"]')]
Inspired from some of the answers above that work with base Python packages I compared the performance of a few (using Python 3.7.3):
Method 1: ast
import ast
list(map(str.strip, ast.literal_eval(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, ast.literal_eval(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import ast', number=100000)
# 1.292875313000195
Method 2: json
import json
list(map(str.strip, json.loads(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, json.loads(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import json', number=100000)
# 0.27833264000014424
Method 3: no import
list(map(str.strip, u'[ "A","B","C" , " D"]'.strip('][').replace('"', '').split(',')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, u'[ \"A\",\"B\",\"C\" , \" D\"]'.strip('][').replace('\"', '').split(',')))", number=100000)
# 0.12935059100027502
I was disappointed to see what I considered the method with the worst readability was the method with the best performance... there are trade-offs to consider when going with the most readable option... for the type of workloads I use Python for I usually value readability over a slightly more performant option, but as usual it depends.
import ast
l = ast.literal_eval('[ "A","B","C" , " D"]')
l = [i.strip() for i in l]
If it's only a one dimensional list, this can be done without importing anything:
>>> x = u'[ "A","B","C" , " D"]'
>>> ls = x.strip('[]').replace('"', '').replace(' ', '').split(',')
>>> ls
['A', 'B', 'C', 'D']
This u can do,
**
x = '[ "A","B","C" , " D"]'
print(list(eval(x)))
**
best one is the accepted answer
Though this is not a safe way, the best answer is the accepted one.
wasn't aware of the eval danger when answer was posted.
There isn't any need to import anything or to evaluate. You can do this in one line for most basic use cases, including the one given in the original question.
One liner
l_x = [i.strip() for i in x[1:-1].replace('"',"").split(',')]
Explanation
x = '[ "A","B","C" , " D"]'
# String indexing to eliminate the brackets.
# Replace, as split will otherwise retain the quotes in the returned list
# Split to convert to a list
l_x = x[1:-1].replace('"',"").split(',')
Outputs:
for i in range(0, len(l_x)):
print(l_x[i])
# vvvv output vvvvv
'''
A
B
C
D
'''
print(type(l_x)) # out: class 'list'
print(len(l_x)) # out: 4
You can parse and clean up this list as needed using list comprehension.
l_x = [i.strip() for i in l_x] # list comprehension to clean up
for i in range(0, len(l_x)):
print(l_x[i])
# vvvvv output vvvvv
'''
A
B
C
D
'''
Nested lists
If you have nested lists, it does get a bit more annoying. Without using regex (which would simplify the replace), and assuming you want to return a flattened list (and the zen of python says flat is better than nested):
x = '[ "A","B","C" , " D", ["E","F","G"]]'
l_x = x[1:-1].split(',')
l_x = [i
.replace(']', '')
.replace('[', '')
.replace('"', '')
.strip() for i in l_x
]
# returns ['A', 'B', 'C', 'D', 'E', 'F', 'G']
If you need to retain the nested list it gets a bit uglier, but it can still be done just with regular expressions and list comprehension:
import re
x = '[ "A","B","C" , " D", "["E","F","G"]","Z", "Y", "["H","I","J"]", "K", "L"]'
# Clean it up so the regular expression is simpler
x = x.replace('"', '').replace(' ', '')
# Look ahead for the bracketed text that signifies nested list
l_x = re.split(r',(?=\[[A-Za-z0-9\',]+\])|(?<=\]),', x[1:-1])
print(l_x)
# Flatten and split the non nested list items
l_x0 = [item for items in l_x for item in items.split(',') if not '[' in items]
# Convert the nested lists to lists
l_x1 = [
i[1:-1].split(',') for i in l_x if '[' in i
]
# Add the two lists
l_x = l_x0 + l_x1
This last solution will work on any list stored as a string, nested or not.
Assuming that all your inputs are lists and that the double quotes in the input actually don't matter, this can be done with a simple regexp replace. It is a bit perl-y, but it works like a charm. Note also that the output is now a list of Unicode strings, you didn't specify that you needed that, but it seems to make sense given Unicode input.
import re
x = u'[ "A","B","C" , " D"]'
junkers = re.compile('[[" \]]')
result = junkers.sub('', x).split(',')
print result
---> [u'A', u'B', u'C', u'D']
The junkers variable contains a compiled regexp (for speed) of all characters we don't want, using ] as a character required some backslash trickery.
The re.sub replaces all these characters with nothing, and we split the resulting string at the commas.
Note that this also removes spaces from inside entries u'["oh no"]' ---> [u'ohno']. If this is not what you wanted, the regexp needs to be souped up a bit.
If you know that your lists only contain quoted strings, this pyparsing example will give you your list of stripped strings (even preserving the original Unicode-ness).
>>> from pyparsing import *
>>> x =u'[ "A","B","C" , " D"]'
>>> LBR,RBR = map(Suppress,"[]")
>>> qs = quotedString.setParseAction(removeQuotes, lambda t: t[0].strip())
>>> qsList = LBR + delimitedList(qs) + RBR
>>> print qsList.parseString(x).asList()
[u'A', u'B', u'C', u'D']
If your lists can have more datatypes, or even contain lists within lists, then you will need a more complete grammar - like this one in the pyparsing examples directory, which will handle tuples, lists, ints, floats, and quoted strings.
You may run into such problem while dealing with scraped data stored as Pandas DataFrame.
This solution works like charm if the list of values is present as text.
def textToList(hashtags):
return hashtags.strip('[]').replace('\'', '').replace(' ', '').split(',')
hashtags = "[ 'A','B','C' , ' D']"
hashtags = textToList(hashtags)
Output: ['A', 'B', 'C', 'D']
No external library required.
This usually happens when you load list stored as string to CSV
If you have your list stored in CSV in form like OP asked:
x = '[ "A","B","C" , " D"]'
Here is how you can load it back to list:
import csv
with open('YourCSVFile.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
rows = list(reader)
listItems = rows[0]
listItems is now list
To further complete Ryan's answer using JSON, one very convenient function to convert Unicode is in this answer.
Example with double or single quotes:
>print byteify(json.loads(u'[ "A","B","C" , " D"]')
>print byteify(json.loads(u"[ 'A','B','C' , ' D']".replace('\'','"')))
['A', 'B', 'C', ' D']
['A', 'B', 'C', ' D']
I would like to provide a more intuitive patterning solution with regex.
The below function takes as input a stringified list containing arbitrary strings.
Stepwise explanation:
You remove all whitespacing,bracketing and value_separators (provided they are not part of the values you want to extract, else make the regex more complex). Then you split the cleaned string on single or double quotes and take the non-empty values (or odd indexed values, whatever the preference).
def parse_strlist(sl):
import re
clean = re.sub("[\[\],\s]","",sl)
splitted = re.split("[\'\"]",clean)
values_only = [s for s in splitted if s != '']
return values_only
testsample: "['21',"foo" '6', '0', " A"]"
You can save yourself the .strip() function by just slicing off the first and last characters from the string representation of the list (see the third line below):
>>> mylist=[1,2,3,4,5,'baloney','alfalfa']
>>> strlist=str(mylist)
['1', ' 2', ' 3', ' 4', ' 5', " 'baloney'", " 'alfalfa'"]
>>> mylistfromstring=(strlist[1:-1].split(', '))
>>> mylistfromstring[3]
'4'
>>> for entry in mylistfromstring:
... print(entry)
... type(entry)
...
1
<class 'str'>
2
<class 'str'>
3
<class 'str'>
4
<class 'str'>
5
<class 'str'>
'baloney'
<class 'str'>
'alfalfa'
<class 'str'>
And with pure Python - not importing any libraries:
[x for x in x.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']]
So, following all the answers I decided to time the most common methods:
from time import time
import re
import json
my_str = str(list(range(19)))
print(my_str)
reps = 100000
start = time()
for i in range(0, reps):
re.findall("\w+", my_str)
print("Regex method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
json.loads(my_str)
print("JSON method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
ast.literal_eval(my_str)
print("AST method:\t\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
[n.strip() for n in my_str]
print("strip method:\t", (time() - start) / reps)
regex method: 6.391477584838867e-07
json method: 2.535374164581299e-06
ast method: 2.4425282478332518e-05
strip method: 4.983267784118653e-06
So in the end regex wins!
This solution is simpler than some I read in the previous answers, but it requires to match all features of the list.
x = '[ "A","B","C" , " D"]'
[i.strip() for i in x.split('"') if len(i.strip().strip(',').strip(']').strip('['))>0]
Output:
['A', 'B', 'C', 'D']
I have a csv file which has many columns. Now my requirement is to find all possible value that are present for that specific column.
Is there any built in function in python that helps me to get these values.
You can us pandas.
Example file many_cols.csv:
col1,col2,col3
1,10,100
1,20,100
2,10,100
3,30,100
Find unique values per column:
>>> import pandas as pd
>>> df = pd.read_csv('many_cols.csv')
>>> df.col1.drop_duplicates().tolist()
[1, 2, 3]
>>> df['col2'].drop_duplicates().tolist()
[10, 20, 30]
>>> df['col3'].drop_duplicates().tolist()
[100]
For all columns:
import pandas as pd
df = pd.read_csv('many_cols.csv')
for col in df.columns:
print(col, df[col].drop_duplicates().tolist())
Output:
col1 [1, 2, 3]
col2 [10, 20, 30]
col3 [100]
I would use a set() for this.
Lets say the csv file is this and we want only unique values from second column.
foo,1,bar
baz,2,foo
red,3,blue
git,3,foo
Here is the code that would accomplish this. I am simply printing out the unique values to test that it worked.
import csv
def parse_csv_file(rawCSVFile):
fileLineList = []
with open(rawCSVFile, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
return fileLineList
def main():
uniqueColumnValues = set()
fileLineList = parse_csv_file('sample.csv')
for row in fileLineList:
uniqueColumnValues.add(row[1]) # Selecting 2nd column here.
print(uniqueColumnValues)
if __name__ == '__main__':
main()
Overly "clever" approach to figuring out unique values for all the rows at once (assumes all columns are the same size, though it ignores empty lines seamlessly):
# Assumes somefile was opened properly earlier
csvin = filter(None, csv.reader(somefile))
for i, vals in enumerate(map(sorted, map(set, zip(*csvin)))):
print("Unique values for column", i)
print(vals)
It uses zip(*csvin) to do a table rotation (converting the normal one row at a time output to one column at a time), then uniquifies each column with set, and (for nice output) sorts it.
I've been trying for weeks to plot 3 sets of (x, y) data on the same plot from a .CSV file, and I'm getting nowhere. My data was originally an Excel file which I have converted to a .CSV file and have used pandas to read it into IPython as per the following code:
from pandas import DataFrame, read_csv
import pandas as pd
# define data location
df = read_csv(Location)
df[['LimMag1.3', 'ExpTime1.3', 'LimMag2.0', 'ExpTime2.0', 'LimMag2.5','ExpTime2.5']][:7]
My data is in the following format:
Type mag1 time1 mag2 time2 mag3 time3
M0 8.87 41.11 8.41 41.11 8.16 65.78;
...
M6 13.95 4392.03 14.41 10395.13 14.66 25988.32
I'm trying to plot time1 vs mag1, time2 vs mag2 and time3 vs mag3, all on the same plot, but instead I get plots of time.. vs Type, eg. for the code:
df['ExpTime1.3'].plot()
I get 'ExpTime1.3' (y-axis) plotted against M0 to M6 (x-axis), when what I want is 'ExpTime1.3' vs 'LimMag1.3', with x-labels M0 - M6.
How do I get 'ExpTime..' vs 'LimMag..' plots, with all 3 sets of data on the same plot?
How do I get the M0 - M6 labels on the x-axis for the 'LimMag..' values (also on the x-axis)?
Since trying askewchan's solutions, which did not return any plots for reasons unknown, I've found that I can get a plot of ExpTimevs LimMagusing df['ExpTime1.3'].plot(),if I change the dataframe index (df.index) to the values of the x axis (LimMag1.3). However, this appears to mean that I have to convert each desired x-axis to the dataframe index by manually inputing all the values of the desired x-axis to make it the data index. I have an awful lot of data, and this method is just too slow, and I can only plot one set of data at a time, when I need to plot all 3 series for each dataset on the one graph. Is there a way around this problem? Or can someone offer a reason, and a solution, as to why I I got no plots whatsoever with the solutions offered by askewchan?\
In response to nordev, I have tried the first version again, bu no plots are produced, not even an empty figure. Each time I put in one of the ax.plotcommands, I do get an output of the type:
[<matplotlib.lines.Line2D at 0xb5187b8>], but when I enter the command plt.show()nothing happens.
When I enter plt.show()after the loop in askewchan's second solution, I get an error back saying AttributeError: 'function' object has no attribute 'show'
I have done a bit more fiddling with my original code and can now get a plot of ExpTime1.3vs LimMag1.3 with the code df['ExpTime1.3'][:7].plot(),by making the index the same as the x axis (LimMag1.3), but I can't get the other two sets of data on the same plot. I would appreciate any further suggestions you may have. I'm using ipython 0.11.0 via Anaconda 1.5.0 (64bit) and spyder on Windows 7 (64bit), python version is 2.7.4.
If I have understood you correctly, both from this question as well as your previous one on the same subject, the following should be basic solutions you could customize to your needs.
Several subplots:
Note that this solution will output as many subplots as there are Spectral classes (M0, M1, ...) vertically on the same figure. If you wish to save the plot of each Spectral class in a separate figure, the code needs some modifications.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
plt.figure(figsize=(7,5)) # Set the size of your figure, customize for more subplots
for i in range(len(df)):
xs = np.array(df[df.columns[0::2]])[i] # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]])[i] # Use values from even numbered columns as y-values
plt.subplot(len(df), 1, i+1)
plt.plot(xs, ys, marker='o') # Plot circle markers with a line connecting the points
for j in range(len(xs)):
plt.annotate(df.columns[0::2][j][-3:] + '"', # Annotate every plotted point with last three characters of the column-label
xy = (xs[j],ys[j]),
xytext = (0, 5),
textcoords = 'offset points',
va = 'bottom',
ha = 'center',
clip_on = True)
plt.title('Spectral class ' + df.index[i])
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.tight_layout()
plt.show()
All in same Axes, grouped by rows (M0, M1, ...)
Here is another solution to get all the different Spectral classes plotted in the same Axes with a legend identifying the different classes. The plt.yscale('log') is optional, but seeing as how the values span such a great range, it is recommended.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
for i in range(len(df)):
xs = np.array(df[df.columns[0::2]])[i] # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]])[i] # Use values from even numbered columns as y-values
plt.plot(xs, ys, marker='o', label=df.index[i])
for j in range(len(xs)):
plt.annotate(df.columns[0::2][j][-3:] + '"', # Annotate every plotted point with last three characters of the column-label
xy = (xs[j],ys[j]),
xytext = (0, 6),
textcoords = 'offset points',
va = 'bottom',
ha = 'center',
rotation = 90,
clip_on = True)
plt.title('Spectral classes')
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.yscale('log')
plt.legend(loc='best', title='Spectral classes')
plt.show()
All in same Axes, grouped by columns (1.3", 2.0", 2.5")
A third solution is as shown below, where the data are grouped by the series (columns 1.3", 2.0", 2.5") rather than by the Spectral class (M0, M1, ...). This example is very similar to
#askewchan's solution. One difference is that the y-axis here is a logarithmic axis, making the lines pretty much parallel.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
xs = np.array(df[df.columns[0::2]]) # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]]) # Use values from even numbered columns as y-values
for i in range(df.shape[1]/2):
plt.plot(xs[:,i], ys[:,i], marker='o', label=df.columns[0::2][i][-3:]+'"')
for j in range(len(xs[:,i])):
plt.annotate(df.index[j], # Annotate every plotted point with its Spectral class
xy = (xs[:,i][j],ys[:,i][j]),
xytext = (0, -6),
textcoords = 'offset points',
va = 'top',
ha = 'center',
clip_on = True)
plt.title('Spectral classes')
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.yscale('log')
plt.legend(loc='best', title='Series')
plt.show()
You can call pyplot.plot(time, mag) three different times in the same figure. It would be wise to give a label to them. Something like this:
import matplotlib.pyplot as plt
...
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['LimMag1.3'], df['ExpTime1.3'], label="1.3")
ax.plot(df['LimMag2.0'], df['ExpTime2.0'], label="2.0")
ax.plot(df['LimMag2.5'], df['ExpTime2.5'], label="2.5")
plt.show()
If you want to loop it, this would work:
fig = plt.figure()
ax = fig.add_subplot(111)
for x,y in [['LimMag1.3', 'ExpTime1.3'],['LimMag2.0', 'ExpTime2.0'], ['LimMag2.5','ExpTime2.5']]:
ax.plot(df[x], df[y], label=y)
plt.show()