I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.
Related
I am trying to use neo4j-admin import to populate a neo4j database with CSV input data. According to documentation, escaping quotation marks with \" is not supported but my input has these and other formatting anomalies. Hence neo4j-admin import obviously fails for input CSV
> neo4j-admin import --mode=csv --id-type=INTEGER \
> --high-io=true \
> --ignore-missing-nodes=true \
> --ignore-duplicate-nodes=true \
> --nodes:user="import/headers_users.csv,import/users.csv"
Neo4j version: 3.5.11
Importing the contents of these files into /data/databases/graph.db:
Nodes:
:user
/var/lib/neo4j/import/headers_users.csv
/var/lib/neo4j/import/users.csv
Available resources:
Total machine memory: 15.58 GB
Free machine memory: 598.36 MB
Max heap memory : 17.78 GB
Processors: 8
Configured max memory: -2120992358.00 B
High-IO: true
IMPORT FAILED in 97ms.
Data statistics is not available.
Peak memory usage: 0.00 B
Error in input data
Caused by:ERROR in input
data source: BufferedCharSeeker[source:/var/lib/neo4j/import/users.csv, position:91935, line:866]
in field: company:string:3
for header: [user_id:ID(user), login:string, company:string, created_at:string, type:string, fake:string, deleted:string, long:string, lat:string, country_code:string, state:string, city:string, location:string]
raw field value: yyeshua
original error: At /var/lib/neo4j/import/users.csv # position 91935 - there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Universidad Pedagógica Nacional \"F'
My question is whether is it possible to skip or ignore poorly formatted rows of the CSV file for which neo4j-admin import throws an error. No such option seems available in the docs. I understand that solutions exist using LOAD CSV and that CSVs ought to be preprocessed prior to import. Note I am able to import CSV successfully when I fix formatting issues.
Perhaps it's worth describing the differences between the bulk importer and LOAD CSV.
LOAD CSV does a transactional load of your data into the database - this means you get all of the ACID goodness, etc. The side effect of this is that it's not the fastest way to load data.
The bulk importer assumes that the data is in a data-base ready format, that you've dealt with duplicates, any processing you needed to get it into the right form, etc., and will just pull the data as is and form it as specified into the database. This is not a transactional load of the data, and because it assumes the data being loaded is already 'database ready', it is ingested very quickly indeed.
There are other options to import data in, but generally if you need to do some sort of row skipping/correction on import, you don't really want to be doing it via the offline bulk importer. I would suggest you either do some form of pre-processing on your on CSV prior to using, neo4j-admin import, or look at one of the other import options available where you can dictate how to handle any poorly formatted rows.
I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?
What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?
It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...
I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.
my data set contains 1300000 observations with 56 columns. it is a .csv file and i'm trying to import it by using proc import. after importing i find that only 44 out of 56 columns are imported.
i tried increasing the guessing rows but it is not helping.
P.S: i'm using sas 9.3
If (and only in that case as far as I am aware) you specify the file to load in a filename statement, you have to set the option lrecl to a value that is large enough.
If you don't, the default is only 256. Ergo, if your csv has lines longer than 256, he will not read the full line.
See this link for more information (just search for lrecl): https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000308090.htm
If you have SAS Enterprise Guide (I think it's now included with all desktop licenses) try out the import wizard. It's excellent. And it will generate code you can reuse with a little editing.
It will take a while to run because it will read your entire file before writing the import logic.