How do I read a file from a FileSystem with pyarrow.csv.read_csv? - csv

I want to read a single CSV file in a google bucket with pyarrow. How do I do this?
I can create a FileSystem object with gcsfs, but I don't see a way to provide this to pyarrow.csv.read_csv.
Do I need to create some sort of file stream from the file system? What's the best way to do this?
import gcsfs
import pyarrow.csv as csv
fs = gcsfs.GCSFileSystem(project='foo')
csv.read_csv("bucket/foo/bar.csv", filesystem=fs)
TypeError: read_csv() got an unexpected keyword argument 'filesystem'
Using pyarrow version 6.0.1

I'm guessing you are working with this doc. You're correct that the approach listed there does not work with read_csv because there is no filesystem parameter. We can still generally do this but the process is a bit different.
Pyarrow has its own filesystem abstraction. If you have a pyarrow filesystem then you can first open a file and then use that file to read the CSV:
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.fs as fs
local_fs = fs.LocalFileSystem()
with local_fs.open_input_file('foo/bar.csv') as csv_file:
csv.read_csv(csv_file)
Unfortunately, a gcsfs.GCSFileSystem is not a "pyarrow filesystem" but you have a few options.
The method gcsfs.GCSFileSystem.open can give you a "python file object" which you can use as input to pyarrow.csv.read_csv.
import gcsfs
import pyarrow.csv as csv
fs = gcsfs.GCSFileSystem(project='foo')
with fs.open("bucket/foo/bar.csv", 'rb') as csv_file:
csv.read_csv(csv_file)

Related

Why is pyspark unable to read this csv file?

I was unable to find this problem in the numerous Stack Overflow similar questions "how to read csv into a pyspark dataframe?" (see list of similar sounding but different questions at end).
The CSV file in question resides in the tmp directory of the driver of the cluster, note that this csv file is intentionally NOT in the Databricks DBFS cloud storage. Using DBFS will not work for the use case that led to this question.
Note I am trying to get this working on Databricks runtime 10.3 with Spark 3.2.1 and Scala 2.12.
y_header = ['fruit','color','size','note']
y = [('apple','red','medium','juicy')]
y.append(('grape','purple','small','fresh'))
import csv
with (open('/tmp/test.csv','w')) as f:
w = csv.writer(f)
w.writerow(y_header)
w.writerows(y)
Then use python os to verify the file was created:
import os
list(filter(lambda f: f == 'test.csv',os.listdir('/tmp/')))
Now verify that the databricks Spark API can see the file, have to use file:///
dbutils.fs.ls('file:///tmp/test.csv')
Now, optional step, specify a dataframe schema for Spark to apply to the csv file:
from pyspark.sql.types import *
csv_schema = StructType([StructField('fruit', StringType()), StructField('color', StringType()), StructField('size', StringType()), StructField('note', StringType())])
Now define the PySpark dataframe:
x = spark.read.csv('file:///tmp/test.csv',header=True,schema=csv_schema)
Above line runs no errors, but remember, due to lazy execution, the spark engine still has not read the file. So next we will give Spark a command that forces it to execute the dataframe:
display(x)
And the error is:
FileReadException: Error while reading file file:/tmp/test.csv. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
Caused by: FileNotFoundException: File file:/tmp/test.csv does not exist. . .
and digging into the error I found this: java.io.FileNotFoundException: File file:/tmp/test.csv does not exist. And I already tried restarting the cluster, restart did not clear the error.
But I can prove the file does exist, only for some reason Spark and Java are unable to access it, because I can read in the same file with pandas no problem:
import pandas as p
p.read_csv('/tmp/test.csv')
So how do I get spark to read this csv file?
appendix - list of similar spark read csv questions I searched through that did not answer my question: 1 2 3 4 5 6 7 8
I guess databricks file loader doesn't seem to recognize the absolute path /tmp/.
you can try the following work around.
Read the file using path using Pandas Dataframe
Pass the pandas dataframe to Spark using CreateDataFrame function
Code :
df_pd = pd.read_csv('File:///tmp/test.csv')
sparkDF=spark.createDataFrame(df_pd)
sparkDF.display()
Output :
I made email contact with a Databricks architect, who confirmed that Databricks can only read locally (from the cluster) in a single node setup.
So DBFS is the only option for random writing/reading of text data files in a typical cluster which contains >1 node.

How to import data from json file to mongodb atlas collection

I wanted to import data to my collection in mongodb atlas, and I was following the documentation: https://docs.mongodb.com/compass/beta/import-export/ but there is no "ADD DATA" and I don't know if Im using some other version or Im doing something else wrongly.
I need to import whole file which is json array.
The docs you referenced are for a future version of Compass. If you want to import from EJSON at the command line you can use mongoimport.
Here's the simplest syntax, but there are many variations possible.
mongoimport --db=users --collection=contacts --file=contacts.json

Looking for a way to write a csv file to mdf4 format?

looking for a way to write/convert .csv data to mf4 format programatically.
I have done it singularly using IPEmotion.
I have checked out https://www.turbolab.de/mdf_libf.htm and opened up their code listed on that site. If anyone has even used this before I would be grateful for advice. I primarily use LabVIEW, but am open to python/c++/C# solutions.
You could load the csv using pandas and append it to a MDF object using this lib https://asammdf.readthedocs.io/en/latest/api.html#mdf4 (see the append method)
from asammdf import MDF
import pandas as pd
df = pd.read_csv('input.csv')
mdf = MDF()
mdf.append(df)
mdf.save('output.mf4')

Is there a way to import csv file from Cassandra DevCenter?

One way to import csv file is by using copy command in cqlsh. I am wondering is there an effective way to import csv file from DevCenter ?
Sorry, there currently is no way to import from CSV using DevCenter.

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.