Newbee to AWS.
I want to use COPY command to import table from dynamoDB to Redshift. But I occurred the error message such as "Invalid operation: Unsupported Data Type: Current Version only supports Strings and Numbers". Or I can only have values in some columns and others(more important one, such as sensor value in paylaod) are null.
In dynamoDB, hashkey and rangekey are String, but the payload is JSON format, how I can COPY this payload to Redshift?
the documentation in AWS didn't provide a detail solution.
COPY command can be used to copy data from DynamoDB table which has scalar data types (i.e. STRING and NUMBER).
If you have any attributes in DynamoDB table which has different data types (i.e. Map, List, Set etc.), the COPY command would fail (i.e. it is not supported at the moment).
Only Amazon DynamoDB attributes with scalar STRING and NUMBER data
types are supported. The Amazon DynamoDB BINARY and SET data types are
not supported. If a COPY command tries to load an attribute with an
unsupported data type, the command will fail. If the attribute does
not match an Amazon Redshift table column, COPY does not attempt to
load it, and it does not raise an error.
Related
I have a CSV Data stored at some location. I want to read this data and store it in some PCollection but I am unable to specify the type of PCollection or the way it should be stored in a PCollection
As a quite straitforward way - you can read it as just strings (e.g. TextIO for Java SDK will return a PCollection<String>) and then parse it with your own PTransform into required format that returns a PCollection<YourPOJO> (see an example).
When I use the BigQuery console manually, I can see that the 3 options when exporting a table to GCS are CSV, JSON (Newline delimited), and Avro.
With Airflow, when using the BigQueryToCloudStorageOperator operator, what is the correct value to pass to export_format in order to transfer the data to GCS as JSON (Newline delimited)? Is it simply JSON? All examples I've seen online for BigQueryToCloudStorageOperator use export_format='CSV', never for JSON, so I'm not sure what the correct value here is. Our use case needs JSON, since the 2nd task in our DAG (after transferring data to GCS) is to then load that data from GCS into our MongoDB Cluster with mongoimport.
I found that the value export_format='NEWLINE_DELIMITED_JSON' was required after finding the documentation https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationextract and refering to the values for destinationFormat
According to the BigQuery documentation the three possible formats to which you can export BigQuery query results are: CSV, JSON, and Avro (and this is compatible with the UI drop-down menu).
I would try with export_format='JSON' as you already proposed.
I am trying to load a JSON file into BigQuery using the bq load command
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON project_abd:ds.online_data gs://online_data/file.json
One of the key:value pair in the JSON file looks like -
"taxIdentifier":"T"
The bq load fails with the message - Error while reading data, error message: JSON parsing error in row
starting at position 713452: Could not convert value to boolean.
Field: taxIdentifier; Value: T (The JSON is really huge, hence cant paste it here)
I am really confused as to why the autodetect is treating the value T as boolean. I have tried all combinations of creating the table with STRING datatype and then load the table, but due to autodetect, it errors out mentioning - changed type from STRING to BOOLEAN, if I do not use the autodetect the load succeeds.
I have to use the "autodetect" feature, since the JSON is a result of an API call and the columns may increase or decrease.
Any idea why the value T is behaving weird, and how to get around this ?
Does the JSON fields array in the view.drill file created by create view support describing an ARRAY, STRUCT type?
Want to define views of PARQUET files so they are described by the JDBC driver (DatabaseMetadata.getTables, getColumns...). Want to project the columns of the PARQUET file as the actual type (i.e. INTEGER) versus leaving Drill to describe them as the ANY type. Unfortunately, that requires CAST ( as ) and the target types supported by CAST does not include ARRAY or STRUCT.
Hence, if the view.drill file can be externally defined without requiring a CAST and supported complex types would build them programmatically.
I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class label.
When I run the following in pyspark
csvdata = spark.read.csv("gs://[bucket]/csv/*.csv", header=True,mode="DROPMALFORMED")
I get a
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o53.csv.
: com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480
Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum
number of columns your input can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
Where/how do I set the maximum columns for the parser to use the
data for machine learning.
Is there a better way to ingest the data for use with Apache mllib?
This question points to defining a class for the dataframe to use but would it be possible to define such a large class without having to create 210,000 entries?
Use option:
spark.read.option("maxColumns", n).csv(...)
where n is number of columns.