Retrieve CSV format over https - csv

I'm retrieving data in csv format from a web service using http request.
data can contain a web address with parms in json format ( https://example.com?parm=abc&opt={"key1":"value1","key2":"value2"} )
the comma within the JSON string causes the subsequent data processing using spark to mess up the data.
the web service providing the data does not allow to change the csv delimiter.
resulting data in csv file is doubling double quotes inside like
' "https....""key1"":""value1""... '
Are there any options in the http protocol to 'correctly' transport/quote the data or is this rather a spark issue ?
using Postman to analyse the 'look and feel' of the delivered data

I found the solution in spark
spark.read.load(file, format='csv', header = True, quote = '"', escape = '"')

Related

How can I escape characters when creating a csv file in Data fusion?

I am creating a pipeline in google data fusion that should read records from a source database and write them to a target csv file in Cloud Storage.
The problem is that in the resulting file the separator character is a comma ",", and some fields are of type string and contains phrases with commas, so when I try to load the resulting file in wrangler as a csv, I get an error, because the number of fields in the csv does not match the number of fields in the schema (because of fields containing comma strings).
How can I escape these special characters in the pipeline?
Thanks and regards
Try writing the data as TSV instead of CSV (set the format of the sink plugin to tsv). Then load the data as tsv in Wrangler.

Store json string in txt file or database

Suppose we are getting lots of list data in json format, for every single day the api returns the same data.
Now if I apply the filter on the json then where can I store the API json result for the current day so that there is no need to call the API multiple times.
How can I store it in a txt file or in a database or maybe in cache?
It depends on your aims. You may use a text file or the DB field.
You may use a Redis as a cache.
Try to start with text file at first. Probably it will help you.
1) Draft usage of text (.json) file.
// $json = json_encode($array); // if you don't have json data
$filePath = sprintf('%s_cache.json', date('Y-m-d'));
file_put_contents($filename, $json);
2) Usage of JSON in MySQL
INSERT INTO table VALUES (JSON_OBJECT("key", "value")); // something like this
INSERT INTO table VALUES ('{"key": "value"}'); // or this one
More details about MySQL are here: https://dev.mysql.com/doc/refman/5.7/en/json.html

With an existing JSON file, how do I upload this data to BigQuery and calculate a new field with data in the JSON file?

I have a new-line delimited JSON file I'm going to be upload to BigQuery.
Each row of the JSON file contains many fields and I would like to add two of these together, to form a new column that contains both values added together.
However, there are millions of records, I would rather not use SQL to do this after the JSON has been fully uploaded.
Is there any process that can be done to accomplish what I'm looking for?
Maybe something in the JSON schema? Or maybe something in the way I upload the JSON and JSON schema to BigQuery?
Many thanks! :)
Check my "lazy data loading in BigQuery" post:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
The trick is to set up BigQuery to look into the GCS files as federated CSV files with a rare character separator. Then you can do any transform within BigQuery itself.
In my case:
#standardSQL
CREATE VIEW `fh-bigquery.views.wikipedia_views_test_ddl`
AS SELECT
PARSE_TIMESTAMP('%Y%m%d-%H%M%S', REGEXP_EXTRACT(_FILE_NAME, '[0-9]+-[0-9]+')) datehour
, REGEXP_EXTRACT(line, '([^ ]*) ') wiki
, REGEXP_EXTRACT(line, '[^ ]* (.*) [0-9]+ [0-9]+') title
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') AS INT64) views
, CAST(REGEXP_EXTRACT(line, ' ([0-9]+)$') AS INT64) zero
, _FILE_NAME filename
, line
FROM `fh-bigquery.views.wikipedia_views_gcs`WHERE REGEXP_EXTRACT(line, ' ([0-9]+) [0-9]+$') IS NOT NULL # views
AND REGEXP_EXTRACT(line, ' ([0-9]+)$') = '0' # zero
Instead of REGEXP_EXTRACT you could do JSON_EXTRACT/JSON_EXTRACT_SCALAR, or - for maximum flexibility - JavaScript UDFs.

U-SQL: Schematized input files

How can I use schematized input files in an U-SQL script? That is, how can I use multiple files as input to an EXTRACT clause?
According to
https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx?f=255&MSPPError=-2147217396
and
https://social.msdn.microsoft.com/Forums/en-US/0ad563d8-677c-46e7-bb3e-e1627025f2e9/read-data-from-multiple-files-and-folder-using-usql?forum=AzureDataLake&prof=required
I tried both
#rs =
EXTRACT s_type string, s_filename string
FROM "/Samples/logs/{s_filename:*}.txt"
USING Extractors.Tsv();
and
#rs =
EXTRACT s_type string
FROM "/Samples/logs/{*}.txt"
USING Extractors.Tsv();
Both versions resulting in an error message complaining about '*' being an invalid character.
File set is not supported locally so far. It will work when you run it on cloud Azure Data Lake Analytics account.

Importing CSV file in Talend - how to set options to match Excel

I have a CSV file that I can open in Excel 2012 and it comes in perfectly. When I try to setup the metadata for this CSV file in Talend the fields (columns) are not splitting the same was as Excel splits them. I suspect I am not properly setting the metadata.
The specific issue is that I have a column with string data in it which may contain commas within the string. For example suppose I have a CSV file with three columns: ID, Name and Age which looks like this:
ID,Name,Age
1,Ralph,34
2,Sue,14
3,"Smith, John", 42
When Excel reads this CSV file it looks at the second element of the third row ("Smith, John") as a single token and places it into a cell by itself.
In Talend it trys to break this same token into two since there is a comma within the token. Apparently Excel ignores all delimeters within a quoted string while Talend by default does not.
My question is how to I get Talend to behave the same as Excel?
if you use tfileinputdelimited component to read this csv file, you can use delimeter as "," and under csv options properties of this component you should enable Text Enclosure """ option or even if you use metadata there would be an option to define string/text enclosure - here you should mention """ to resolve your problem