I have a custom log file in a JSON format, the app we are using will output an 1 entry per file as follows
{"cuid":1,"Machine":"001","cuSize":0,"starttime":"2017-03-19T15:06:48.3402437+00:00","endtime":"2017-03-19T15:07:13.3402437+00:00","rejectcount":47,"fitcount":895,"unfitcount":58,"totalcount":1000,"processedcount":953}
I am trying to ingest this into ElasticSearch. I believe this is possible as I am using ES5.X
I have configured my FileBeat prospector, I have attempted to at least pull out 1 field from the file for now, namely the Cuid
filebeat.prospectors:
input_type: log
json.keys_under_root : true
paths:
C:\Files\output*-Account-*
tags : ["json"]
output.elasticsearch:
# The Logstash hosts
hosts: ["10.1.0.4:9200"]
template.name: "filebeat"
template.path: "filebeat.template.json"
template.overwrite: true
processors:
- decode_json_fields:
fields: ["cuid"]
When I start the FileBeat , it seems to harvest the files, As I get an entry in the FileBeat Registry files
2017-03-20T13:21:08Z INFO Harvester started for file:
C:\Files\output\001-Account-20032017105923.json
2017-03-20T13:21:27Z INFO Non-zero metrics in the last 30s: filebeat.harvester.closed=160 publish.events=320 filebeat.harvester.started=160 registrar.states.update=320 registrar.writes=2
However, I can't seem to find the data within Kibana. I am not entirely sure how to find it?
I have ensured the FileBeat templates are loaded in kibana.
I have tried to read the documentation and I think I understand it correctly but I am still very hazy, as I am totally new to the stack.
I am still not entirely sure if this is the right answer. However I managed to resolve my particular issue. In that we were writing out Multiple JSON files to the directory, all with just single line in, as detailed above. Although FileBeats, appeared to harvest the files, I don't think it was reading them.
I modified the application to make use of log4Net, and implement RollingFileAppender, I then ran the application, which started emiting logs to the directory and if by magic, without modifying the my Filebeat.yml it all just started working.
I can only conclude that Filebeats, does not handle multi one line json files. Unless there is some other configuration I am unaware of.
Related
I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.
Any ideas?
As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here.
Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:
gcs.yml
version: 2
sources:
- name: clickup
database: external_tables
loader: gcloud storage
tables:
- name: task
description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
external:
location: 'gs://bucket/clickup/task/*'
options:
format: csv
skip_leading_rows: 1
quote: "\""
allow_quoted_newlines: true
Or something very similar.
And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:
Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.
I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.
I see several posts here and in a Google search for org.apache.hadoop.mapred.InvalidInputException
but most deal with HDFS files or trapping errors. My issue is that while I can read a CSV file from spark-shell, running it from a compiled JAR constantly returns an org.apache.hadoop.mapred.InvalidInputException error.
The rough process of the jar:
1. read from JSON documents in S3 (this works)
2. read from parquet files in S3 (this also succeeds)
3. write a result of a query against #1 and #2 to a parquet file in S3 (also succeeds)
4. read a configuration csv file from the same bucket #3 is written to. (this fails)
These are the various approaches that I have tried in code:
1. val osRDD = spark.read.option("header","true").csv("s3://bucket/path/")
2. val osRDD = spark.read.format("com.databricks.spark.csv").option("header", "true").load("s3://bucket/path/")
All variations of the two above with s3, s3a and s3n prefixes work fine from the REPL but inside a JAR they return this:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/path/eventsByOS.csv
So, it found the file but can't read it.
Thinking this was a permissions issue, I have tried:
a. export AWS_ACCESS_KEY_ID=<access key> and export AWS_SECRET_ACCESS_KEY=<secret> from the Linux prompt. With Spark 2 this has been sufficient to provide us access to the S3 folders up until now.
b. .config("fs.s3.access.key", <access>)
.config("fs.s3.secret.key", <secret>)
.config("fs.s3n.access.key", <access>)
.config("fs.s3n.secret.key", <secret>)
.config("fs.s3a.access.key", <access>)
.config("fs.s3a.secret.key", <secret>)
Before this failure, the code reads from parquet files located in the same bucket and writes parquet files to the same bucket. The CSV file is only 4.8 KB in size.
Any ideas why this is failing?
Thanks!
Adding stack trace:
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:253)
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1332)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.take(RDD.scala:1326)
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1367)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.first(RDD.scala:1366)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:206)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
scala.Option.orElse(Option.scala:289)
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
nothing springs out when I paste that stack into the IDE, but I'm looking at a later version of Hadoop and can't currently switch to older ones.
Have a look at these instructions
That landsat gz file is actually a CSV file you can try to read in; it's the one we generally use for testing because its there and free to use. Start by seeing if you can work with it.
If using spark 2.0, use spark's own CSV package.
Do use S3a, not the others.
I solved this problem by adding the specific Hadoop configuration for the appropriate method (s3 in the example here). The odd thing is that the above security works for everything in Spark 2.0 EXCEPT reading the CSV.
This code solved my problem using S3.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", p.aws_accessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey",p.aws_secretKey)
I have been trying to find what a db.json is and why it is being automatically genereated. All the documentation says in hexo.io is:
$ hexo clean
Cleans the cache file (db.json) and generated files (public).
What is this exactly? Since these are all static pages, is this some sort of makeshift database?
most commonly db.json is used when you're running a server using hexo server. I believe its for performance improvements. It doesn't affect the generation (hexo generate) and deployments(hexo deploy)
db.json file stores all the data needed to generate your site. There are all posts post, tags, categories etc. The data is stored in a JSON formatted string so it's easier and faster to parse the data and generate the site.
I'm new to Camel and the lack of similar questions online leads me to believe Im doing something silly. I am using camel 2.12.1 components and am parsing large CSV files from both local directories and by downloading them over SFTP. I've found that
split(body().tokenize("\n")).streaming().unmarshal().csv()
works for local files (windows 7); I get multiple exchanges with a
List<List<String>>
for each line in the csv file. But when I use that same route syntax from an sftp component (connecting to a linux server to download the files), I get a single exchange with a single line that reads like a call to "ls":
-rwxrwxrwx 1 userName userName 83400 Dec 16 14:11 fileName.csv
Through trial and error, I found that
split(body()).streaming().unmarshal().csv()
with the sftp component will correctly load and parse the file, but it doesn't do it in streaming mode, it loads the entire file into memory before unmarshalling it into a single exchange.
I found a similar bug report (https://issues.apache.org/jira/browse/CAMEL-6231) from camel 2.10 which Clause closed as Invalid indicating the reporter was using thread and parallel with stream incorrectly, but Im not configuring either of those capabilities.
The sftp stanza im using is:
sftp://192.168.1.1?fileName=fileName.csv&username=userName&password=secret!&idempotent=true&localWorkDirectory=tmp
The file stanza is :
"file:test/data?noop=true&fileName=fileName.csv"
Anyone have an idea on what im doing wrong ?
make an intermediate route to solve the problem.
<route id="StagingFtpFileCopy">
<from uri="ftp://{{uriFtpPath}}"/>
<to uri="file://data/staging"/>
</route>
I faced the same issue as well with SFTP (Camel 2.25.0). However, before splitting the route into 2 different routes (as proposed by others), I used the below url
sftp://:22/?username=random&password=random&delay=2000&move=archive&readLock=changed&bridgeErrorHandler=true&recursive=false&disconnect=true&stepwise=false&streamDownload=true&localWorkDirectory=C:/temp
with my below route definition,
from("sftp url").split().tokenize("\n", 10, true).streaming().to("log:out")
As this route also downloads the remote file into local (same as 2 route option) and then treat the local file with normal streaming (as Sinsanator, mentioned it works perfectly with file), the memory footprint becomes a truely saw-tooth while downloading (upto 100MB) and then it used upto 150MB while processing but roughly a saw-tooth nature again.
One advantage (in my view) with this approach could be we can handle the completion related task (e.g. moving the remote file into other directory) based on actual processing completion (which is not possible automatically if we break the routes). Also, since the downloading is managed by Camel, the local file gets deleted automatically on completion of processing.