Issue in running a Gobblin Job - json

I am new to Gobblin and i am trying to run a simple job in standalone mode but i am getting the folowing error:
Task failed due to "com.google.gson.JsonSyntaxException:
com.google.gson.stream.MalformedJsonException: Expected name at line 1
column 72 path $.fields."
My job file is:
job.name=MRJob1
job.group=MapReduce
job.description=A getting started job for MapReduce
source.class=gobblin.source.extractor.filebased.TextFileBasedSource
source.filebased.downloader.class=gobblin.source.extractor.filebased.CsvFileDownloader
converter.classes=gobblin.converter.csv.CsvToJsonConverterV2,gobblin.converter.avro.JsonIntermediateToAvroConverter
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
source.filebased.fs.uri=file:///
source.filebased.data.directory=/home/sahil97/Downloads/gobblin-dist/CsvSource/
source.schema=[{"ColumnName":"FIRST_NAME","comment": "","isNullable": "true","dataType":{"type":"string"}}{"ColumnName":"LAST_NAME","comment": "","isNullable": "true","dataType":{"type":"string"}},{"ColumnName":"GENDER","comment": "","isNullable": "true","dataType":{"type":"string"}},{"ColumnName":"AGE","comment": "","isNullable": "true","dataType":{"type":"int"}}]
source.skip.first.record=true
source.csv_file.delimiter=,
converter.csv.to.json.delimiter=,
extract.table.type=append_only
extract.table.name=CsvToAvro
extract.namespace=MapReduce
converter.classes=gobblin.converter.csv.CsvToJsonConverterV2,gobblin.converter.avro.JsonIntermediateToAvroConverter
writer.destination.type=HDFS
writer.output.format=AVRO
data.publisher.type=gobblin.publisher.BaseDataPublisher
My CSV file is:Repo.txt
FIRST_NAME,LAST_NAME,GENDER,AGE
Sahil,Gaur,Male,22
Sagar,Gaur,Male,21
Dushyant,Saini,Male,23
Devyani,Kaulas,Female,21
Sanchi,Theraja,Female,22
Shreya,Gupta,Female,21
Chirag,Thakur,Male,22
Manish,Sharma,Male,23
Abhishek,Soni,Male,24
Varnita,Sachdeva,Female,22
Deepam,Chaurishi,Male,22

If this is the actual json: You have an extra comma here .The error says you have bad json syntax.
So this is probably one of the first places to look.
{
"ColumnName": "FIRST_NAME",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
}, // Try this comma while defining the json
{
"ColumnName": "LAST_NAME",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
},
{
"ColumnName": "GENDER",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
},
{
"ColumnName": "AGE",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "int"
}
}

Related

Json schema field order

I know that fields listed in a json schema object have no defined order, since they are not an array, but I am looking for a way to be able to display them in the proper order in my application UI.
Workarounds I have found so far include things like using a different serializer, or even hard-coding a number into the field name.
I would like to come up with something that works with my current setup.
Hibernate, Spring Boot, and a react-app front end.
given this GET request:
/profile/personEntities
with header: Accept: application/schema+json
I will receive this:
{
"title": "Person entity",
"properties": {
"birthday": {
"title": "Birthday",
"readOnly": false,
"type": "string",
"format": "date-time"
},
"lastName": {
"title": "Last name",
"readOnly": false,
"type": "string"
},
"address": {
"title": "Address",
"readOnly": false,
"type": "string",
"format": "uri"
},
"firstName": {
"title": "First name",
"readOnly": false,
"type": "string"
},
"email": {
"title": "Email",
"readOnly": false,
"type": "string"
},
"cellPhone": {
"title": "Cell phone",
"readOnly": false,
"type": "string"
}
},
"requiredProperties": [
"firstName",
"lastName"
],
"definitions": {},
"type": "object",
"$schema": "http://json-schema.org/draft-04/schema#"
}
I have tried adding #JsonProperty(index=2) to the field, but nothing changes.
Thank you much for any tips.
If you're using Jackson to handle your serialization/deserialization you can use #JsonPropertyOrder - from their docs:
// ensure that "id" and "name" are output before other properties
#JsonPropertyOrder({ "id", "name" })
// order any properties that don't have explicit setting using alphabetic order
#JsonPropertyOrder(alphabetic=true)
See: http://fasterxml.github.io/jackson-annotations/javadoc/2.3.0/com/fasterxml/jackson/annotation/JsonPropertyOrder.html

Avro Schema format Exception - “SecurityClassification” is not a defined name

I'm trying to use this avro schema
{
"type": "record",
"name": "ComplianceEntity",
"namespace": "com.linkedin.events.metadata",
"fields": [
{
"name": "fieldPath",
"type": "string"
},
{
"name": "complianceDataType",
"type": {
"type": "enum",
"name": "ComplianceDataType",
"symbols": [
"NONE",
"MEMBER_ID"
],
"symbolDocs": {
"NONE": "None of the following types apply",
"MEMBER_ID": "ID for LinkedIn members"
}
}
},
{
"name": "complianceDataTypeUrn",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fieldFormat",
"type": [
"null",
{
"type": "enum",
"name": "FieldFormat",
"symbols": [
"NUMERIC"
],
"symbolDocs": {
"NUMERIC": "Numerical format, 12345"
},
"doc": "The field format"
}
]
},
{
"name": "securityClassification",
"type": "SecurityClassification"
},
{
"name": "valuePattern",
"default": null,
"type": [
"null",
"string"
]
}
]
}
To generate and avro file using the avro-tools:
java -jar ./avro-tools-1.8.2.jar compile schema ComplianceEntity.avsc .
But I am getting the following error message:
Exception in thread "main" org.apache.avro.SchemaParseException: "SecurityClassification" is not a defined name. The type of the "securityClassification" field must be a defined name or a {"type": ...} expression.
Could anyone tell, why SecurityClassification is not identified as a defined name?
You are using it as type of your field, however you are not defining it properly like for complianceDataType, that's the reason why you are getting the avro exception
{
"name": "securityClassification",
"type": "SecurityClassification"
}
Make sure that if you have more than 1 Schema, you pass all of them, especially dependency schemas. It is supported from AVRO 1.5.3 https://issues.apache.org/jira/browse/AVRO-877.
java -jar ./avro-tools-1.8.2.jar compile schema SecurityClassification.avsc ComplianceEntity.avsc .

js-beautify config for Arrays of Objects

Is there any way to configure js-beautify to keep the following format:
"structure": [
{
"name": "heading",
"text": "",
"default": "",
"type": "string"
},
{
"name": "flickr-album-id",
"text": "",
"type": "string"
},
]
js-beautify pulls in the curly bracket of the first object into the first line. I know of the option keep_array_indentation but i dont want to disable the general auto indentation because errors wont get fixed otherwise.
The result of js-beautify - that i want to prevent - will be
"structure": [{
"name": "heading",
"text": "",
"default": "",
"type": "string"
},
{
"name": "flickr-album-id",
"text": "",
"type": "string"
},
]

Error adding docs to CouchDB

My local PouchDB docs won't replicate to my remote CouchDB.
The sync is happening, because the browser downloaded my design schema, so it's not a permissions issue. I think that my design schema doesn't match my documents, but I struggled to find the correct way to write the schema.
Schema
{
"_id": "_design/schema",
"_rev": "4-3d9a49ebffbbd6b7b146240879baa7e4",
"validate_doc_update": "function(newDoc, oldDoc, userCtx, secObj){ if(userCtx.roles[0] !== 'admin'){throw({forbidden: 'operation forbidden'})} }",
"views": {
"by_module": {
"map": "function(doc){ if(doc.type == 'note'){emit(doc.note);} }"
}
},
"schema": {
"title": "Contact details",
"description": "A document containing a person's contact details.",
"type": "object",
"required": [
"name",
"level"
],
"properties": {
"_id": {
"type": "string"
},
"_rev": {
"type": "string"
},
"application_access": {
"type": "string"
},
"home": {
"type": "string"
},
"home_email": {
"type": "string"
},
"jobtitle": {
"type": "string"
},
"level": {
"type": "string"
},
"mobile1": {
"type": "string"
},
"mobile2": {
"type": "string"
},
"modified": {
"type": "number"
},
"name": {
"type": "string"
},
"work": {
"type": "string"
},
"work_email": {
"type": "string"
},
"_doc_id_rev": {
"type": "string"
}
}
}
}
Doc
{
"_id": "fcb52b3072e2038647b328c0a700147f",
"_rev": "1518449239461",
"application_access": "User",
"home": "",
"home_email": "",
"jobtitle": "Exemplar",
"level": "Bronze",
"mobile1": "0987654321",
"mobile2": "",
"modified": 1518449239461,
"name": "Zachary Zumbeispiel",
"work": "",
"work_email": "",
"_doc_id_rev": "1518449239460::1-ffd3c056614845ada4a68de4793710ac"
}
So the question is, does my doc conform to my schema? Or is my schema wrong?
The "schema" was on the CouchDB instance, and the doc was in PouchDB.
They are replicating now, because I removed the "document schema" and so it can sync & replicate.
So the problem was that I had read somewhere that I needed to add a schema to CouchDB, so I created what I thought was a schema, and evidently the structure of the docs to be synced has to be the same, so the sync did not work because the structure of the docs did not match.
Hence #Flimzy 's explanation that the whole notion of a schema was bobbins meant that I removed the "schema", and voila, the PouchDB and CouchDB can now sync - and problem solved.

OrientDB ETL from CSV DateTime

This is currently my config file
{
"config": {
"haltOnError": false
},
"source": {
"file": {
"path": "/home/user1/temp/real_user/user3.csv"
}
},
"extractor": {
"csv": {
"columns": ["id", "name", "token", "username", "password", "created", "updated", "enabled", "is_admin", "is_banned", "userAvatar"],
"columnsOnFirstLine": true
},
"field": {
"fieldName": "created",
"expression": "created.asDateTime()"
}
},
"transformers": [{
"vertex": {
"class": "user"
}
}],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/user1/orientdb/real_user",
"dbAutoCreateProperties": true,
"dbType": "graph",
"classes": [{
"name": "user",
"extends": "V"
}],
"indexes": [{
"class": "user",
"fields": ["id:long"],
"type": "UNIQUE"
}]
}
}
}
and my csv currently looks like this
6,Olivia Ong,2jkjkl54k5jklj5k4j5k4jkkkjjkj,\N,\N,2013-11-15 16:36:33,2013-11-15 16:36:33,1,0,\N,\N
7,Matthew,32kj4h3kjh44hjk3hk43hkhhkjhasd,\N,\N,2013-11-18 17:29:13,2013-11-15 16:36:33,1,0,\N,\N
I still wonder when I execute the ETL, orientdb wont recognize my datetime as datetime.
I tried putting datatype in column fields "created:datetime", but it ended up not showing any data.
I wonder what is the proper solution for this case.
from next version, 2.2.8, you will be able to define different default pattern for date and datetime: CSV extractor documentation
Note that when you define the columns, you need to specify the column's type:
"columns": ["id:string", "created:date", "updated:datetime"],
You can use the snapshot jar of 2.2.8 of ETL module with 2.2.7 without any problem:
https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-etl/2.2.8-SNAPSHOT/