Ingest csv file times in BQ - csv

In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.

Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.

Related

Google BigQuery streaming - time to insert

I am working with Google BigQuery for the first time on a client project and have created packages in SSIS to insert data into tables (an odd combination but one required by my client), using an SSIS plugin (CData).
I am looking to insert around 100k rows into a BigQuery table, however, when I look to do further update queries on this table, these cannot be performed because the data is still in the buffer. How does one know how long this will take in BigQuery and are there ways to speed up the process?
It doesn't matter if the data is still in the buffer. If you query the table, the data in the buffer will be included too. Just one of the many awesome things about BigQuery.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
A record that arrives in the streaming buffer will remain there for
some minimum amount of time (minutes). During this period while the
record is buffered, it's possible that you may issue a query that will
reference the table. The Instant Availability Reader allows workers
from the query engine to read the buffered records prior to being
committed to managed storage.
data is still in the buffer. How does one know how long this will take in BigQuery?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export operations. See more in documentation
Meantime, have in mind - Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. So, as stated above - up to 90 minutes
are there ways to speed up the process?
The only way in your case is to use loading data instead of streaming data. As per how I understand your case - data is in MS SQL, so you can potentially make your SSIS package batch aware and load it batch by batch through Cloud Storage

How to increase data processing speed in CSV data into SQL Server?

I have used Nifi-0.6.1 with combination of GetFile+SplitText+ReplaceText processor to split the csv data which has 30MB (300 000 rows).
GetFile is able to pass 30mb to SplitText very quickly.
In SpliText +Replace Text takes 25 mins to split the data into Json.
Just 30 mb data is taking 25 mins for store csv into SQL Server.
It performs conversion byte by byte.
I have tried Concurrent Task option in Processor. It can able to speed but it also take more time. At that time it attain 100% cpu Usage.
How can I perform csv data into sql Server faster?
Your incoming CSV file has ~300,000 rows? You might try using multiple SplitText processors to break that down in stages. One big split can be very taxing on system resources, but dividing it into multiple stages can smooth out your flow. The typically recommended maximum is between 1,000 to 10,000 per split.
See this answer for more details.
You mention splitting the data into JSON, but you're using SplitText and ReplaceText. What does your incoming data look like? Are you trying to convert to JSON to use ConvertJSONtoSQL?
If you have CSV incoming, and you know the columns, SplitText should pretty quickly split the lines, and ReplaceText can be used to create an INSERT statement for use by PutSQL.
Alternatively, as #Tomalak mentioned, you could try to put the CSV file somewhere where SQLServer can access it, then use PutSQL to issue a BULK INSERT statement.
If neither of these is sufficient, you could use ExecuteScript to perform the split, column parsing, and translation to SQL statement(s).

Neo4j batch insertion with .CSV files taking huge amount of time to sort&index

I'm trying to create a database with data collected from google n-grams. It's actually a lot of data, but after the creation of the CSV files the insertion was pretty fast. The problem is that, immediately after the insertion, the neo4j-import tool indexes the data, and this step its taking too much time. It's been more than an hour and it looks like it achieved 10% of progress.
Nodes
[*>:9.85 MB/s---------------|PROPERTIES(2)====|NODE:198.36 MB--|LABE|v:22.63 MB/s-------------] 25M
Done in 4m 54s 828ms
Prepare node index
[*SORT:295.94 MB-------------------------------------------------------------------------------] 26M
This is the console info atm. Does anyone have a suggestion about what to do to speed up this process?
Thank you. (:
Indexing takes a long time depending on number of nodes. I tried indexing with 10 million nodes and it took around 35 minutes, but you can still try these settings :
Increase your page cache size which is stored in '/var/lib/neo4j/conf/neo4j.properties' file (in my ubuntu system). Edit the following line
dbms.pagecache.memory=4g
according to your RAM, allocate size, here, 4g means 4gb space. Also, you can try changing java memory size which is stored in neo4j-wrapper.conf
wrapper.java.initmemory=1024
wrapper.java.maxmemory=1024
You can also read neo4j documentation on this - http://neo4j.com/docs/stable/configuration-io-examples.html

Optimizations for MySQL Data Import

So far every time I have done a data load, it was almost always on enterprise grade hardware and so I guess I never realized 3.4 million records is big. Now to the question...
I have a local MySQL server on Windows 7, 64 bit, 4 GB RAM machine. I am importing a csv through standard 'Import into Table' functionality that is shipped with the Developer package.
My data has around 3,422,000 rows and 18 columns. 3 Columns of type double and rest is all text. Size of cvs file is about 500 MB
Both data source (CSV) and destination (MySQL) are on the same machine so I guess no network bottle neck.
It took almost 7 hours to load 200,000 records. By this speed it might take me 4 days to load entire data. Given the widespread popularity of MySQL I think there has to be a better way to make data load faster.
I have data only in CSV format and the rudimentary way I can think of is to split it into different blocks and try loading it.
Can you please suggest what optimizations can I do to speed up ?

Load very large CSV into Neo4j

I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows).
I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports.
I appreciate any suggestion on how to load large CSV files into Neo4j.
I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having \" in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward.
Why don't you try this approach (using groovy): http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
you will create uniqueness constraint on nodes, so duplicates won't be created.