How to upload csv data that contains newline with dbt - csv

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.
Any ideas?

As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here.
Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:
gcs.yml
version: 2
sources:
- name: clickup
database: external_tables
loader: gcloud storage
tables:
- name: task
description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
external:
location: 'gs://bucket/clickup/task/*'
options:
format: csv
skip_leading_rows: 1
quote: "\""
allow_quoted_newlines: true
Or something very similar.
And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:
Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.

Related

ADF Merge-Copying JSON files in Copy Data Activity creates error for Mapping Data Flow

I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.

Using Apache Nifi to collect files from 3rd party Rest APi - Flow advice

I am trying to create a flow within Apache-Nifi to collect files from a 3rd party RESTful APi and I have set my flow with the following:
InvokeHTTP - ExtractText - PutFile
I can collect the file that I am after, as I have specified this within my Remote URL however when I get all of the data from said file it is outputting multiple (100's) of the same files to my output directory.
3 things I need help with:
1: How do I get the flow to output the file in a readable .csv rather than just a file with no ext
2: How can I stop the processor once I have all of the data that I need
3: The Json file that I have been supplied with gives me the option to get files from a certain date range:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/1553731200000
Or I can choose a specific file:
https://api.3rdParty.com/reports/v1/scheduledReports/download/877800/201904/CTDDaily/2019-04-02T01:50:00Z.csv
But how can I create a command in Nifi to automatically check for newer files, as this process will be running daily and we will be looking at downloading a new file each day.
If this is too broad, please help me by letting me know so I can edit this post.
Thanks.
Note: 3rdParty host name has been renamed to comply with security - therefore links will not directly work. Thanks.
1) You change the filename of the flow file to anything you want using the UpdateAttribute processor. If you want to make it have a ".csv" extension then you can add a property named "filename" with a value of "${filename}.csv" (without the quotes when you enter it).
2) By default most processors have a scheduling strategy of timer-driver 0 seconds, which means keep running as fast as possible. Go to the configuration of the processor on the scheduling tab and configure the appropriate schedule, it sounds like you probably want CRON scheduling to schedule it daily.
3) You can use NiFi expression language statements to create dynamic time ranges. I don't fully understand the syntax for the API that you have to communicate with, but you could do something like this for the URL:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/${now()}
Where now() would return the current timestamp as an epoch.
You can also format it to a date string if necessary:
${now():format('yyyy-MM-dd')}
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Managing a large SPSS (*.sav) file (4.2 GB)

I have received an SPSS file from survey fielded by another company that allegedly only contains ~1500 respondents, but the file size somehow has ballooned 4.2GB. My hunch is that the reason for this is that the file was from a global survey and the 1500 records that have been selected are from the US only so there are a series of blank variables, metadata for those variables that are included in this file and may also be in multiple languages/alphabets.
I only need a subset of this data, and can likely work with it if I removed the metadata but my issue has been that I can't get the damn thing open to cut down on the number of variables. I have been using the tools at my disposal to try the following workarounds, though I'm sure there are better options:
Opening the file using PSPP (freeware SPSS) - this causes the PSPP to stop responding
Using the R command read.spss (from the foreign package) to write a .csv - this claims that the file has a duplicate variable name and won't proceed further
Using the R command spss.system.file to write a .csv - when I tried this, R has spend a lot of time thinking as it as it attempts to run this and has been running for a couple hours with no apparent success.
Using the PSPP text conversion tool (https://pspp.benpfaff.org/) to create either a dictionary or a .csv file - both of these options crash after the file has completed uploading.
I've gone back to the other company to try have them work on reducing the file size, however I wasn't sure if anyone else had any ideas to do either of the following:
Open the file using another program/converter that could turn it into a .csv or other similarly skinny file format
Use another program to at least read only the variable names included in the file so that I can provide the other company with the specific variables I need
The following command from PSPP should do what you need:
$ pspp-convert originalFile.sav output.csv
In case it doesn't, please provide terminal error message.

Ingesting JSON Logfiles in elasticsearch with filebeats

I have a custom log file in a JSON format, the app we are using will output an 1 entry per file as follows
{"cuid":1,"Machine":"001","cuSize":0,"starttime":"2017-03-19T15:06:48.3402437+00:00","endtime":"2017-03-19T15:07:13.3402437+00:00","rejectcount":47,"fitcount":895,"unfitcount":58,"totalcount":1000,"processedcount":953}
I am trying to ingest this into ElasticSearch. I believe this is possible as I am using ES5.X
I have configured my FileBeat prospector, I have attempted to at least pull out 1 field from the file for now, namely the Cuid
filebeat.prospectors:
input_type: log
json.keys_under_root : true
paths:
C:\Files\output*-Account-*
tags : ["json"]
output.elasticsearch:
# The Logstash hosts
hosts: ["10.1.0.4:9200"]
template.name: "filebeat"
template.path: "filebeat.template.json"
template.overwrite: true
processors:
- decode_json_fields:
fields: ["cuid"]
When I start the FileBeat , it seems to harvest the files, As I get an entry in the FileBeat Registry files
2017-03-20T13:21:08Z INFO Harvester started for file:
C:\Files\output\001-Account-20032017105923.json
2017-03-20T13:21:27Z INFO Non-zero metrics in the last 30s: filebeat.harvester.closed=160 publish.events=320 filebeat.harvester.started=160 registrar.states.update=320 registrar.writes=2
However, I can't seem to find the data within Kibana. I am not entirely sure how to find it?
I have ensured the FileBeat templates are loaded in kibana.
I have tried to read the documentation and I think I understand it correctly but I am still very hazy, as I am totally new to the stack.
I am still not entirely sure if this is the right answer. However I managed to resolve my particular issue. In that we were writing out Multiple JSON files to the directory, all with just single line in, as detailed above. Although FileBeats, appeared to harvest the files, I don't think it was reading them.
I modified the application to make use of log4Net, and implement RollingFileAppender, I then ran the application, which started emiting logs to the directory and if by magic, without modifying the my Filebeat.yml it all just started working.
I can only conclude that Filebeats, does not handle multi one line json files. Unless there is some other configuration I am unaware of.

Migrating from Lighthouse to Jira - Problems Importing Data

I am trying to find the best way to import all of our Lighthouse data (which I exported as JSON) into JIRA, which wants a CSV file.
I have a main folder containing many subdirectories, JSON files and attachments. The total size is around 50MB. JIRA allows importing CSV data so I was thinking of trying to convert the JSON data to CSV, but all convertors I have seen online will only do a file, rather than parsing recursively through an entire folder structure, nicely creating the CSV equivalent which can then be imported into JIRA.
Does anybody have any experience of doing this, or any recommendations?
Thanks, Jon
The JIRA CSV importer assumes a denormalized view of each issue, with all the fields available in one line per issue. I think the quickest way would be to write a small Python script to read the JSON and emit the minimum CSV. That should get you issues and comments. Keep track of which Lighthouse ID corresponds to each new issue key. Then write another script to add things like attachments using the JIRA SOAP API. For JIRA 5.0 the REST API is a better choice.
We just went through a Lighthouse to JIRA migration and ran into this. The best thing to do is in your script, start at the top-level export directory and loop through each ticket.json file. You can then build a master CSV or JSON file to import into JIRA that contains all tickets.
In Ruby (which is what we used), it would look something like this:
Dir.glob("path/to/lighthouse_export/tickets/*/ticket.json") do |ticket|
JSON.parse(File.open(ticket).read).each do |data|
# access ticket data and add it to a CSV
end
end