Logstash parse multiline CSV file - csv

I have a CSV file with some fields which contain "\n". The field is in quotes, so it displays properly on excel or using pandas in python. However using the CSV filter in logstash doesn't work properly and gives either CSV parse error or wrong fields. Anyone who has experience with this before?
I also saw this issue on github: https://github.com/logstash-plugins/logstash-filter-csv/issues/34 but it's a year old.

have you tryed the multiline codec?
You should add something like this in your input plugin:
codec => multiline {
pattern => "^[0-9]"
negate => "true"
what => "previous"
}
it tells logstash that every line not starting with a number should be merged with the previous line
see
Loading csv in ElasticSearch using logstash

Related

Error when importing GeoJson into BigQuery

I'm trying to load GeoJson data [1] into BigQuery via Cloud Shell but I'm getting the following error:
Failed to parse JSON: Top-level GeoJson 'type' member should have value 'Feature', but was 'FeatureCollection'.; ParsedString returned false; Could not parse value; Parser terminated before end of string
It feels like the GeoJson file is not formatted properly for BQ but I have no idea if that's true or how to fix it.
[1] https://github.com/tonywr71/GeoJson-Data/blob/master/australian-suburbs.geojson
Expounding on #scespinoza's answer, I was able to convert to new-line delimited GeoJSON and load it to Bigquery with the following steps:
geojson2ndjson geodata.txt > geodata_converted.txt
Using this command, I encountered an error:
But was able to create a workaround by splitting the data into 2 tables, applying the same command.
Loaded table in Bigquery:
Your file is in standard GeoJSON format, but BigQuery only accepts new-line delimited GeoJSON files and individual GeoJSON objects (see documentation: https://cloud.google.com/bigquery/docs/geospatial-data#geojson-files). So, you should first convert the dataset to the appropiated format. Here is a good and simple explanation on how it works: https://stevage.github.io/ndgeojson/.

AHK CSV Parse can't Parse line by line

Using Autohotkey, I Looking at AHK Document.
My File that i want to read type is CSV, so i testing Example 4
This is my CSV file. 2 Row, some column.
So, If open file, data will read comma by comma, and line by line. right?
But..
It is Printed answer.
What the freaking this situation?
Why AHK CSV Parse is not cutting data line by line?
Need Some help : <
P.S : Code is same as Example 4.
Loop, Parse, PositionData, CSV
{
MsgBox, 4, , Field %LineNumber%-%A_Index% is:`n%A_LoopField%`n`nContinue?
IfMsgBox, No, break
}
I found.
If CSV Reading in AHK, You must read Line by Line...

How to read a csv in pyspark using error_bad_line = False as we use in pandas

I am trying to read a csv into pyspark but the problem is that it has a text column due to which there are some bad line in the data
This text column also contains the new line characters due to which the data in further columns is getting corrupted
I have tried using pandas and use some extra parameters to load my csv
a = pd.read_csv("Mycsvname.csv",sep = '~',quoting=csv.QUOTE_NONE, dtype = str,error_bad_lines=False, quotechar='~', lineterminator='\n' )
It is working fine in pandas but I want to load the csv in pyspark
So, is there any similar way to load a csv in pyspark with all the above parameters?
In the current version of spark (I think it is even there from spark 2.2 onwards), you can also read multi-line from csv.
If the newline is your only problem with the text column you can use a read command like this:
spark.read.csv("YOUR_FILE_NAME", header="true", escape="\"", quote="\"", multiLine=True)
Note: in our case the escape and quotation characters where both " so you might want to edit those options with your ~ and include sep = '~'.
You can also look at the documentation (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=csv#pyspark.sql.DataFrameReader.csv) for more details

Invalid byte sequence in UTF-8, CSV import, Rails 4

I have a rake task that populates my database from a CSV file:
require 'csv'
namespace :import_data_csv do
desc "Import teams from csv file"
task import_data: :environment do
CSV.foreach(file, :headers => true) do |row|
#various import tasks
This had been working properly, but with a new CSV file, I'm getting the following error on the 6th row of the CSV file:
Invalid byte sequence in UTF-8
I have looked through the row and can't seem to find any irregular characters.
I've also attempted a couple other fixes recommended on stackoverflow:
- Changing the CSV.foreach to:
reader = CSV.open(file, "r")
reader.each do |row|
And changing:
CSV.foreach(file, headers => true) do |row|
to:
CSV.foreach(file, encoding: "r:ISO-8859-1", :headers => true) do |row|
None of these seem to correct the issue.
Suggestions?
I solved this by saving the file as a MDOS CSV, instead of the standard CSV file or the Windows CSV format.
The answer for me was to take the CSV file and save it to a text file. Then replace the tabs with commas. Then save the file as UTF-8 encoded. Finally, change the .txt to .csv and make sure it works in Excel BUT DON'T save it in Excel. Just close it when you see it looks correct. Then upload it.
A long non-programatic solution, but for my purposes it's sufficient.
Source is here: https://help.salesforce.com/apex/HTViewSolution?id=000003837&language=en_US

Sqoop HDFS to Couchbase: json file format

I'm trying the export data form HDFS to Couchbase and I have a problem with my file format.
My configuration:
Couchbase server 2.0
Stack hadoop cdh4.1.2
sqoop 1.4.2 (compiled with hadoop2.0.0)
couchbase/hadoop connector (compiled with hadoop2.0.0)
When I run the export command, I can easily export files with this kind of format:
id,"value"
or
id,42
or
id,{"key":"value"}
But when I want to apply a Json object it doesn't work!
id,{"key1":"value1,"key2":"value2"}
The content is truncated at the first comma and diplay in base64 by couchbase because now the content is not a correct JSON...
So, my question is how the file must by formated to be stored as a json document?
Can we only export a key/value file?
I want to export json files form HDFS like the cbdocloader do it with files from local file system...
I'm afraid that this expected behavior as Sqoop is parsing your input file as CSV with comma as a separator. You might need tweak your file format to either escape separator or enclose entire JSON string. I would recommend reading how exactly Sqoop is dealing with escaping separators and enclosing strings in the user guide [1].
Links:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#id387098
I think your best bet is to convert the files to tab-delimited, if you're still working on this. If you look at the Sqoop documentation (http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_large_objects), there's an option --fields-terminated-by which allows you to specify which characters Sqoop splits fields on.
If you passed it --fields-terminated-by '\t', and a tab-delimited file, it would leave the commas in place in your JSON.
#mpiffaretti can you post your sqoop export command? I think each JSON object should have its own key value.
key1 {"dataOne":"ValueOne"}
key2 {"dataTwo":"ValueTwo"}
http://ajanacs.weebly.com/blog
In your case change the datea like below may help you solve the issue.
id,{"key":"value"}
id2,{"key2":"value2"}
Let me know if you have further questions on it.
[json] [sqoopexport] [couchbase]