get data from .csv file, analyze, produce output - python3 - csv

I am trying to complete an assignment in Python3. It is very similar to the pdf found here
I have a few questions on both the execution of how to get the information I need, and if possible, some code that could move me along. I am new to python. As right now from the code I have, I keep getting the error "directory not found" after running a function to try and read the data. I know the .csv file should be in the directory where I save it to in WingIDE, but I can't get it to work correctly.
My first question is after getting each line of the .csv file to read from my get_file_list, what is the best way to take each category and throw it into an efficiency equation?
Here is my get_data_list function:
def get_data_list(filename):
data_file = open(filename, "r")
data_list = [ ]
for line_str in data_file:
data_list.append(line_str.strip().split(','))
return data_list
when I run get_data_list("player_regular_season.csv") I get the following error:
builtins.IOError: [Errno 2] No such file or directory:'player_regular_season.csv'

For the first try, you can put the data file to the same directory with the Python program and launch it from the directory.
Try also a single purpose script to learn how to work with directories. Learn the functions from the standard doc 15.1.5. Files and Directories, namely os.getcwd(), os.chdir(path), and then 10.1. os.path — Common pathname manipulations, namely os.path.isfile(path).
But read also the doc of other functions in the documents to learn what is available.
When knowing how to work with filenames and paths, have a look at the 13.1. csv — CSV File Reading and Writing. Not to be scared of all the stuff, start from the end -- 13.1.5. Examples of using the csv module.

Related

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

Managing a large SPSS (*.sav) file (4.2 GB)

I have received an SPSS file from survey fielded by another company that allegedly only contains ~1500 respondents, but the file size somehow has ballooned 4.2GB. My hunch is that the reason for this is that the file was from a global survey and the 1500 records that have been selected are from the US only so there are a series of blank variables, metadata for those variables that are included in this file and may also be in multiple languages/alphabets.
I only need a subset of this data, and can likely work with it if I removed the metadata but my issue has been that I can't get the damn thing open to cut down on the number of variables. I have been using the tools at my disposal to try the following workarounds, though I'm sure there are better options:
Opening the file using PSPP (freeware SPSS) - this causes the PSPP to stop responding
Using the R command read.spss (from the foreign package) to write a .csv - this claims that the file has a duplicate variable name and won't proceed further
Using the R command spss.system.file to write a .csv - when I tried this, R has spend a lot of time thinking as it as it attempts to run this and has been running for a couple hours with no apparent success.
Using the PSPP text conversion tool (https://pspp.benpfaff.org/) to create either a dictionary or a .csv file - both of these options crash after the file has completed uploading.
I've gone back to the other company to try have them work on reducing the file size, however I wasn't sure if anyone else had any ideas to do either of the following:
Open the file using another program/converter that could turn it into a .csv or other similarly skinny file format
Use another program to at least read only the variable names included in the file so that I can provide the other company with the specific variables I need
The following command from PSPP should do what you need:
$ pspp-convert originalFile.sav output.csv
In case it doesn't, please provide terminal error message.

LOAD CSV command keeps using old file: location, ignores command input

I am using Community edition 3.0.5 on Windows 10 . I made multiple efforts to execute a LOAD CSV command before being told that such files cannot reside on an external drive. When I moved the file to users/user/ and tried to execute the LOAD CSV command I got the same message "Couldn't load the external resource at: file:/F:/Neo4j%20DBs/Data.gov%20Consumer%20Complaints/Consumer%20Complaints%20DB/import/Users/CharlieOh/Consumer_Complaints.csv" in spite of the fact the command I entered was
"LOAD CSV WITH HEADERS FROM
'file:///Users/CharlieOh/Consumer_Complaints.csv' AS line
WITH line
LIMIT 1
RETURN line"
I tried to locate the file neo4j.conf and could only find C:\Program Files (x86)\Neo4j Community 3.2.2\Neo4j Community.install4j\i4jparams.conf . I even deleted the old DB and recreated the small amount of data and got the same error, which seems to indicate that the LOAD CSV function is totally useless across all my neo4j databases. BTW the %20 in the file specification was due to suggestions on Stack Overflow as well as using underscores to avoid any use of blank spaces in the file specification. None of it worked and now that I believe that I may have solved the problem by putting the csv file in the user directory, the LOAD CSV function won't let me do it. One last thing, I am following the YouTube video https://www.youtube.com/watch?v=Eh_79goBRUk to learn how to load a csv file into neo4j.
The csv file needs to go in the import directory of the specific database. With Neo4j Desktop this is easy to identify by clicking on the Manage button of the database and then the open folder button. It looks like you've found it.
Once the database import directory is located, you specify it in the LOAD CSV with the statement LOAD CSV WITH HEADERS FROM 'file:///" + FN + "'where FN is your file name, including the csv extension. You do NOT use the full path; that is assumed.

Spark S3 CSV read returns org.apache.hadoop.mapred.InvalidInputException

I see several posts here and in a Google search for org.apache.hadoop.mapred.InvalidInputException
but most deal with HDFS files or trapping errors. My issue is that while I can read a CSV file from spark-shell, running it from a compiled JAR constantly returns an org.apache.hadoop.mapred.InvalidInputException error.
The rough process of the jar:
1. read from JSON documents in S3 (this works)
2. read from parquet files in S3 (this also succeeds)
3. write a result of a query against #1 and #2 to a parquet file in S3 (also succeeds)
4. read a configuration csv file from the same bucket #3 is written to. (this fails)
These are the various approaches that I have tried in code:
1. val osRDD = spark.read.option("header","true").csv("s3://bucket/path/")
2. val osRDD = spark.read.format("com.databricks.spark.csv").option("header", "true").load("s3://bucket/path/")
All variations of the two above with s3, s3a and s3n prefixes work fine from the REPL but inside a JAR they return this:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/path/eventsByOS.csv
So, it found the file but can't read it.
Thinking this was a permissions issue, I have tried:
a. export AWS_ACCESS_KEY_ID=<access key> and export AWS_SECRET_ACCESS_KEY=<secret> from the Linux prompt. With Spark 2 this has been sufficient to provide us access to the S3 folders up until now.
b. .config("fs.s3.access.key", <access>)
.config("fs.s3.secret.key", <secret>)
.config("fs.s3n.access.key", <access>)
.config("fs.s3n.secret.key", <secret>)
.config("fs.s3a.access.key", <access>)
.config("fs.s3a.secret.key", <secret>)
Before this failure, the code reads from parquet files located in the same bucket and writes parquet files to the same bucket. The CSV file is only 4.8 KB in size.
Any ideas why this is failing?
Thanks!
Adding stack trace:
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:253)
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1332)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.take(RDD.scala:1326)
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1367)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.first(RDD.scala:1366)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:206)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
scala.Option.orElse(Option.scala:289)
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
nothing springs out when I paste that stack into the IDE, but I'm looking at a later version of Hadoop and can't currently switch to older ones.
Have a look at these instructions
That landsat gz file is actually a CSV file you can try to read in; it's the one we generally use for testing because its there and free to use. Start by seeing if you can work with it.
If using spark 2.0, use spark's own CSV package.
Do use S3a, not the others.
I solved this problem by adding the specific Hadoop configuration for the appropriate method (s3 in the example here). The odd thing is that the above security works for everything in Spark 2.0 EXCEPT reading the CSV.
This code solved my problem using S3.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", p.aws_accessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey",p.aws_secretKey)

I get a mysterious "Neo.ClientError.Statement.InvalidSyntax" error when loading a CSV in Neo4j

For a course on Excel I was trying to load a CSV in Neo4j (first time using this application) when I was blocked at the first step of replicating an example shown in said course: loading.
The command which was used in the example was this;
LOAD CSV WITH HEADERS FROM "file:/path/to/file/file.csv"
as row
CREATE (m:movie {name:row.movie})
But it gave syntax errors. I found out I could correct it by using double \ and add "file:";
LOAD CSV WITH HEADERS FROM "file://C:\\path\\to\\file\\file.csv"
as row
CREATE (m:movie {name:row.movie})
Neo4j accepts this syntax, processes for a few moments, and returns YET ANOTHER error;
Neo.TransientError.Statement.ExternalResourceFailure
I tried the same commands (original and my own) in the online Neo4j console but no luck. I can reach the file using that path without problem; it really is there. The CSV file consist out of just 5 strings of regular letters, that's all. No fancy formatting or characters.
What's going on?
Not that mysterious, Neo4j's IMPORT CSV function looks for the specified CSV file in the import directory within your server configuration for that database, as specified at the top of its server configuration file. (IE: dbms.directories.import=import in your neo4j.conf file.)
You should create the import directory in...
"C:\Users\[User Name]\Documents\Neo4j\default.graphdb\"
If you place your CSV file in there, you can specify any sub-directory or just the "file.csv" you want to import with the IMPORT CSV function as below.
LOAD CSV WITH HEADERS FROM "file:///file.csv"
AS row
RETURN row
LIMIT 5
Try using:
"file:///C:/path/to/file/file.csv"
Since your file is on your local computer, the third / following the file scheme is not preceded by a host name or address -- but it still needs to be there. Also, file URI path separators should be forward slashes (even on Windows machines).
See the File URI scheme Wikipedia page if you need more information.