I am trying to load large number of files to apache drill database through CSV files via CTAS command.
The query fails completely even if one record in a million is having a wrong value for timestamp.
Is there anyway we can skip invalid records and load the rest in apache drill CTAS command?
Query Used:
create table RAW_LOADER as select time_report, record_count, TEXT_COL
from dfs./path/to/csv/;
Related
Summary: I am trying to improve our daily transfer of data from BigQuery into a MongoDB Cluster using Airflow - getting mongoimport error at the end.
We need to transfer 5 - 15GB of data daily from BigQuery to MongoDB. This transfer is not brand new data, it replaces data from the previous day that is outdated (our DB size mostly stays the same, does not grow 15GB per day). Our Airflow DAG used to use a PythonOperator to transfer data from BigQuery into MongoDB, which relied on a Python function that did everything (connect to BQ, connect to Mongo, query data into pandas, insert into Mongo, create Mongo indexes). It looked something like the function in this post.
This task took the name of a BigQuery table as a parameter, and using python it creates a table with the same name in our MongoDB Cluster. However it is slow. as_dataframe() is slow, and insert_many() is slow-ish as well. When this task is run for many big tables at once, we also receive out-of-memory errors. To handle this, we've created a separate transfer_full_table_chunked function that also transfers from BigQuery to Mongo, but uses a for-loop, querying/inserting small chunks at a time. This prevents the out of memory errors, but it results in 100 queries + inserts per table, which seems like a lot for 1 table transfer.
Per an answer from here with a few upvotes:
I would recommend writing data from BigQuery to a durable storage service like Cloud Storage then loading into MongoDB from there.
Following the suggestion, we've created this task:
transfer_this_table_to_gcs = BigQueryToCloudStorageOperator(
task_id='transfer_this_table_to_gcs',
source_project_dataset_table='myproject.mydataset.mytable',
destination_cloud_storage_uris=['gs://my-bucket/this-table/conference-*.json'],
export_format='JSON',
bigquery_conn_id='bigquery_conn_id'
)
...which does successfully export our BigQuery table into GCS, automatically creating multiple files using the partition key we've set in BigQuery for the table. Here's a snippet of one of those files (this is newline delimited JSON I think):
However, when we try to run mongoimport --uri "mongodb+srv://user#cluster-cluster.dwxnd.gcp.mongodb.net/mydb" --collection new_mongo_collection --drop --file path/to/conference-00000000000.json
we get the error:
2020-11-13T11:53:02.709-0800 connected to: localhost
2020-11-13T11:53:02.866-0800 dropping: mydb.new_mongo_collection
2020-11-13T11:53:03.024-0800 Failed: error processing document #1: invalid character '_' looking for beginning of value
2020-11-13T11:53:03.025-0800 imported 0 documents
I see that the very first thing in the file is a _id, which is probably the invalid character that the error message is referring to. This _id is explicitly in there for MongoDB as the unique identifier for the row. I'm not quite sure what to do about this / if this is an issue with the data format (ndjson), or with our table specifically.
How can I transfer large data daily from BigQuery to MongoDB in general?
I have created SSIS job for inserting records from csv file to sql server database.
if i run the job first time, records are inserted in db successfully, but if i again run the job for second time, again it will store the same records(duplicate).
So if i run my job multiple time , then multiple time records are inserted in db.
So Is there any way to avoid duplicate records to be inserted in database.?
Please use Look up transformation in SSIS to find out a match in old records and if no match is found then insert that record. Or you can always load the new data to a staging area where you will use CDC(change data capture) to load only unmatched out put through an execute SQL task.
My Question is that when data is select successfully then delete from MySQL db.When Select Query is fails due to large data then it can not delete data using shell script because in that case select data are also with image store in system then it can also move to another location and make it to tar file.
I have a Java Application that uses the JDBC:ODBC bridge for a connection with DBF Files linked to a Microsoft Access Database (thus, I'm using the driver for a connection to a Microsoft Access Database)
There is a table named SALFAC, that contains the following fields: NRO_FAC, COD_ITE, CAN_ITE, PRC_ITE and DSC_ITE among other columns. When I perform the following query: SELECT NRO_FAC, COD_ITE, CAN_ITE, PRC_ITE, DSC_ITE FROM SALFAC, without a WHERE clause it works fine. But when I execute the following: SELECT NRO_FAC, COD_ITE, CAN_ITE, PRC_ITE and DSC_ITE WHERE NRO_FAC=151407, my program throws a SQLException with the message The search key was not found in any record.
The NRO_FAC column is an integer type column, so using quotes results in a syntax error.
I compacted and repaired the entire database with no avail. Also, I tested the query directly on Microsoft Access 2010 and it gave me the same error. Yesterday I tested with another JDBC:ODBC bridge to the DBF files directly and also gave me the same error with the same query.
There is no blank spaces on the tables names and the columns names.
¿Is there any additional step to do in order to make queries like these work? I need execute the query with the WHERE clause. Also, for each DBF file is a NTX file. Must I do something with these files as well?
Thanks in advance
EDIT: I found something yesterday that might help. I changed the way I search the rows by inserting the entire DBF table content in an MS-Access temporary table, row by row, and then execute the query in the temprary table. It inserted the first 9 rows correctly, but the 10th row was next to a row that is marked as deleted and then the query crashed. Does the "marked-as-deleted" rows affect a query in MS-Access and/or dBase? If it does, is it possible ignore the "marked-as-deleted" rows using the JDBC:ODBC bridge? Also, must I install the Clipper commands (like DBU or PACK) in the server (it doesn't have them)?
is there any way that I can upload an xlsx-file to my mysql database automatically every 12 hours?
I have an xlsx-file with around 600 rows. The target table already exists.
I would like to perform the following steps:
1. Delete the content of the existing table.
2. Insert the data from the xlsx-file.
This should be performed every 12 hours. Is there a way doing this without using php?
Thanks in advance.
Yes. You can use LOAD DATA LOCAL INFILE provided that the file is in CSV foremat else convert the file to CSV format.
Delete the content of the existing table.
Before you do so take a backup of the table. You can create a backup intermediary table and insert the data there.
Insert the data from the xlsx-file.
use LOAD DATA INFILE and import the data.
This should be performed every 12 hours.
You can create a SQL script with all this steps. Create a scheduled task (Windows) which runs every 12 hour.
You can do it using Data Import tool in dbForge Studio for MySQL (command-line mode).
How to:
Create data-import template file: open Data Import master, select target table, check Repopulate mode (delete all + inserts), and save template file.
Use created template to import your file in command-line mode. Use Windows Scheduled Tasks to run it periodically.