pandas returning empty DataFrames for CSV - csv

I have some large csv and xlsx files which I need to set up pandas DataFrames for. I have code which locates these files within the directory (when printed, these show correct pathnames). These paths are then passed to a helper function which is meant to set up the required DataFrames for the files, then the data will be passed to other functions for some manipulation. I intend to have the data written to a file (by loading a template, writing the data to it, and saving this file) once this is completed.
I currently have code like:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
When I try print(df), I get:
Empty DataFrame
Columns: [a.csv]
Index: []
and print(StringIO(filename)) gives me:
<_io.StringIO object at 0x004D1990>
However, when I leave out the StringIO() around filename in the function, I get this error:
OSError: File b'a.csv' does not exist
Everywhere that I've been able to find information on this has either just said import and start using, or talks about using read_csv() rather than from_csv() (from this question, which wasn't very helpful here), and even the current pandas docs basically say that it should be as easy as passing the file to pandas.read_csv().
1) I've checked that I have full permissions and that the file is valid and exists. Why am I getting the OSError?
2) When I use StringIO(), why am I still getting an empty DataFrame here? How can I fix this?
Thanks in advance.

I have solved this.
StringIO was the root cause of this problem. Because I'm on Windows, os.path.is_file() was returning False, and I got the error:
OSError: File b'a.csv' does not exist
It wasn't until I stumbled upon this page from the Python 2.5 docs that I discovered that the call should actually be os.path.isfile() on Windows because it uses ntpath behind the scenes. This is to better handle the difference in pathnames between systems (Windows uses '\', Unix uses '/').
Because I had something weird going on in my paths, pandas was unable to properly load the CSV files into DataFrames.
By simply changing my code from this:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
to this:
import pandas
# some set-up functions (which have been updated)
def createDataFrame(filename):
basepath = config.complete_datasets_dir
fullpath = os.path.join(basepath, filename)
if filename.endswith('.csv'):
df = pandas.read_csv(fullpath, skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
and appropriately updating the function which calls that function:
def somefunc():
dfs = []
data_lists = getInputFiles() # checks data directory for files containing info
for item in data_lists:
tdata = createDataFrames(item)
dfs.append(tdata)
print(dfs)
I was able to get the output I was looking for:
[ 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30
2 31 32 33 34 35 36 37 38 39 40, 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30]
which is a list of two DataFrames, the first of which came from a CSV containing only the numbers 1-40 (on 4 rows total, no headers); the second file contains only the numbers 1-30 (formatted the same way).
I hope this helps someone in the future.

Related

spark input_line like input_file_name when reading line separated json

I have a new line separated json file. Each line consists of a json array which contains different type of documents. The standard behavior of spark is that when I read it naivly
df = spark.read.json(file.path)
spark creates a new line for each document in each array, while the different document types become each a column and for each line only one is not null.
I need to recover the line number of each document as the documents on the same line have a relation which is otherwise not recoverable.
I imagined there would be something similar to input_file_name() but there is none. Is there a way to achieve this?
An example input is
[{"firstAttribute":1, "secondAttribute":2},{"firstAttribute":10, "secondAttribute":20}]
[{"firstAttribute":3, "secondAttribute":4},{"thirdAttribute":5, "secondAttribute":6}]
The resulting dataframe now looks like this:
firstAttribute
secondAttribute
thirdAttribute
1
2
10
20
3
4
6
5
Now I would like to see the linenumber of the source file for the specific entry like this
line number
firstAttribute
secondAttribute
thirdAttribute
1
1
2
1
10
20
2
3
4
2
6
5

Spark - Strange characters when reading CSV file

I hope someone could help me please. My problem is the following:
To read a CSV file in Spark I'm using the code
val df=spark.read.option("header","true").option("inferSchema","true").csv("/home/user/Documents/filename.csv")
assuming that my file is called filename.csv and the path is /home/user/Documents/
To show the first 10 results I use
df.show(10)
but instead I get the following result which contains the character � and not showing the 10 results as desired
scala> df.show(10)
+--------+---------+---------+-----------------+
| c1| c2| c3| c4|
+--------+---------+---------+-----------------+
|��1.0|5450|3007|20160101|
+--------+---------+---------+-----------------+
The CSV file looks something like this
c1 c2 c3 c4
1 5450 3007 20160101
2 2156 1414 20160107
1 78229 3656 20160309
1 34963 4484 20160104
1 7897 3350 20160105
11 13247 3242 20160303
2 4957 3350 20160124
1 73083 4211 20160207
The file that I'm trying to read is big. When I try smaller files I don't get the strange character and I can see the first 10 results without problem.
Any help is appreciated.
Sometimes it is not the problem caused by settings of Spark. Try to re-save(save as) your CSV file as "CSV UTF-8 (comma delimited)", then rerun your code, the strange characters will gone. I had similar problem when read some CSV file containing German words, then I did above, it is all good.

Creating unique node and relationship NEO4J over huge dataset

My question is very similar to this one:
How to create unique nodes and relationships by csv file imported in neo4j?
I have a textfile with around 2.5 million lines that has two columns, each one being node ids:
1234 345
1234 568
345 984
... ...
Each line represents a relationship (so 2.5 million relationships): first_column nodeid-> FOLLOWS -> second_column nodeid. There are around 80,000 unique nodes in this file.
Based on the link above, I did:
USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///home/user_name/Desktop/bigfile.csv' AS line FIELDTERMINATOR ' '
MERGE (n:Userid { id: toInt(line[0]) })
WITH line, n
MERGE (m:Userid { id: toInt(line[1]) })
WITH m,n
MERGE (n)-[:FOLLOWS]->(m)
I am assuming this code
creates node n or m if it doesn't exist (and finds it if it does exist), and creates a relationship from n to m.
If n or m exists and already has many other edges (relationships) pointing to and from other nodes, this would just add another edge from n to m (not creating a brand new node when it already exists)
My main question is I am wondering how to make this process faster.
This is being done on Ubuntu, and I changed the values from 512 to 2048 MB for memory in the conf/neo4j-wrapper.conf file. (maximum I can increase on my Virtual Machine)
Should I try doing the Import tool?
Based on example on this website, neo4j.com/developer/guide-import-csv/ under "Super Fast Batch Importer For Huge Datasets",
./bin/neo4j-import --into mydatabase.db --id-type INTEGER \
--nodes allnodes.csv \
--delimiter " " \
--relationships:FOLLOWS bigfile.csv
And to do this, I need to reformat files so that:
allnodes.csv shows
userID:ID(Userid)
1234
5678
...
And bigfile.csv shows
:START_ID(Userid) :END_ID(Userid)
1234 345
1234 568
345 984
*Two columns delimited by space*
And when I run this import, I get this error:
Input error: Expected '--nodes' to have at least 1 valid item, but had 0 []
Caused by:Expected '--nodes' to have at least 1 valid item, but had 0 []
java.lang.IllegalArgumentException: Expected '--nodes' to have at least 1 valid item, but had 0 []
How do I fix this error? And for the csv files, do I put them in same folder where I run this command (neo4j folder)?
Your command line probably has the wrong paths for your two CSV files.

How to merge two ipython notebooks correctly without getting json error?

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.
This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).
Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.

Convert a dta file to csv without Stata software

Is there a way to convert a dta file to a csv?
I do not have a version of Stata installed on my computer, so I cannot do something like:
File --> "Save as csv"
The frankly-incredible data-analysis library for Python called Pandas has a function to read Stata files.
After installing Pandas you can just do:
>>> import pandas as pd
>>> data = pd.io.stata.read_stata('my_stata_file.dta')
>>> data.to_csv('my_stata_file.csv')
Amazing!
You could try doing it through R:
For Stata <= 15 you can use the haven package to read the dataset and then you simply write it to external CSV file:
library(haven)
yourData = read_dta("path/to/file")
write.csv(yourData, file = "yourStataFile.csv")
Alternatively, visit the link pointed by huntaub in a comment below.
For Stata <= 12 datasets foreign package can also be used
library(foreign)
yourData <- read.dta("yourStataFile.dta")
You can do it in StatTransfer, R or perl (as mentioned by others), but StatTransfer costs $$$ and R/Perl have a learning curve.
There is a free, menu-driven stats program from AM Statistical Software that can open and convert Stata .dta from all versions of Stata, see:
http://am.air.org/
I have not tried, but if you know Perl you can use the Parse-Stata-DtaReader module to convert the file for you.
The module has a command-line tool dta2csv, which can "convert Stata 8 and Stata 10 .dta files to csv"
Another way of converting between pretty much any data format using R is with the rio package.
Install R from CRAN and open R
Install the rio package using install.packages("rio")
Load the rio library, then use the convert() function:
library("rio")
convert("my_file.dta", "my_file.csv")
This method allows you to convert between many formats (e.g., Stata, SPSS, SAS, CSV, etc.). It uses the file extension to infer format and load using the appropriate importing package. More info can be found on the R-project rio page.
The R method will work reliably, and it requires little knowledge of R. Note that the conversion using the foreign package will preserve data, but may introduce differences. For example, when converting a table without a primary key, the primary key and associated columns will be inserted during the conversion.
From http://www.r-bloggers.com/using-r-for-stata-to-csv-conversion/ I recommend:
library(foreign)
write.table(read.dta(file.choose()), file=file.choose(), quote = FALSE, sep = ",")
In Python, one can use statsmodels.iolib.foreign.genfromdta to read Stata datasets. In addition, there is also a wrapper of the aforementioned function which can be used to read a Stata file directly from the web: statsmodels.datasets.webuse.
Nevertheless, both of the above rely on the use of the pandas.io.stata.StataReader.data, which is now a legacy function and has been deprecated. As such, the new pandas.read_stata function should now always be used instead.
According to the source file of stata.py, as of version 0.23.0, the following are supported:
Stata data file versions:
104
105
108
111
113
114
115
117
118
Valid encodings:
ascii
us-ascii
latin-1
latin_1
iso-8859-1
iso8859-1
8859
cp819
latin
latin1
L1
As others have noted, the pandas.to_csv function can then be used to save the file into disk. A related function numpy.savetxt can also save the data
as a text file.
EDIT:
The following details come from help dtaversion in Stata 15.1:
Stata version .dta file format
----------------------------------------
1 102
2, 3 103
4 104
5 105
6 108
7 110 and 111
8, 9 112 and 113
10, 11 114
12 115
13 117
14 and 15 118 (# of variables <= 32,767)
15 119 (# of variables > 32,767, Stata/MP only)
----------------------------------------
file formats 103, 106, 107, 109, and 116
were never used in any official release.
StatTransfer is a program that moves data easily between Stata, Excel (or csv), SAS, etc. It is very user friendly (requires no programming skills). See www.stattransfer.com
If you use the program just note that you will have to choose "ASCII/Text - Delimited" to work with .csv files rather than .xls
Some mentioned SPSS, StatTransfer, they are not free. R and Python (also mentioned above) may be your choice. But personally, I would like to recommend Python, the syntax is much more intuitive than R. You can just use several command lines with Pandas in Python to read and export most of the commonly used data formats:
import pandas as pd
df = pd.read_stata('YourDataName.dta')
df.to_csv('YourDataName.csv')
SPSS can also read .dta files and export them to .csv, but that costs money. PSPP, an open source version of SPSS, which is rough, might also be able to read/export .dta files.
PYTHON - CONVERT STATA FILES IN DIRECTORY TO CSV
import glob
import pandas
path=r"{Path to Folder}"
for my_dir in glob.glob("*.dta")[0:1]:
file = path+my_dir # collects all the stata files
# get the file path/name without the ".dta" extension
file_name, file_extension = os.path.splitext(file)
# read your data
df = pandas.read_stata(file, convert_categoricals=False, convert_missing=True)
# save the data and never think about stata again :)
df.to_csv(file_name + '.csv')
For those who have Stata (even though the asker does not) you can use this:
outsheet produces a tab-delimited file so you need to specify the comma option like below
outsheet [varlist] using file.csv , comma
also, if you want to remove labels (which are included by default
outsheet [varlist] using file.csv, comma nolabel
hat tip to:
http://www.ats.ucla.edu/stat/stata/faq/outsheet.htm