Max date row using Talend Data Integration - json

Using Talend Data Integration, following data is inputted from a CSV file
Time A B C
18:22.0 -0.30107087 3.79899955 6.52000999
18:23.5 -9.648633 0.84515321 1.50116444
18:25.0 -6.01184034 7.66982508 4.42568207
18:25.9 -9.22426033 3.12263751 5.10443783
18:26.7 -9.00698662 4.03901815 0.01316811
18:27.4 -4.31255579 6.25724602 5.02961922
18:28.2 -2.67013335 7.5932107 5.41628265
18:28.8 -1.76213241 6.26981592 7.44536877
18:29.5 -2.18590617 5.58567238 6.23928976
18:30.3 0.42078096 3.1429882 8.46290493
18:30.9 0.36391866 3.02926373 8.86752415
18:31.6 0.35673606 3.07176089 8.93396378
18:32.4 0.35374331 3.05081153 8.93994904
18:33.0 0.38187516 3.05799413 8.89745235
18:33.7 0.32920274 3.03644633 8.9315691
18:34.4 0.37529111 3.07475352 8.93575954
18:35.0 0.40342298 3.07654929 8.86393356
18:35.7 0.35254622 3.05260706 8.9267807
How do I extract only the max date row (18:35.7 0.35254622 3.05260706 8.9267807) and load it into a JSON?

You can achieve this by sorting your file on the Time column (reverse alphanumeric sort gives the most recent Time on top), then take the first row and write it to a JSON file. Like so :
tSampleRow_1 config :

Related

How to convert csv to Json using expression transformation in informatica?

I have a csv file, I am converting this to Json array format. Below are the row wise operations in expression transformation for the two fields.
region
country
Json(output port): '{'||'"region": '||'"'||Region||'",'||'"Country": '||'"'||Country||'"'||'},'
output:
{"region": "Australia and Oceania", "Country": "Tuvalu"},
This output is saved in a text file with session file properties as fixed width.
enter code here
second mapping expression:
JSON(input)
V_JSON_start(variable port):INSTR(JSON,'{',1,1)
V_JSON_end(variable port):instr(JSON,'}',1,10)
O_Json(output port):'['||substr(JSON,V_JSON_start,V_JSON_end)||']'
output:
[{"region": "Australia and Oceania","Country": "Tuvalu"},
{"region": "Central America and the Caribbean","Country": "Grenada"}]
when I try to fetch next 10 records as json, it pulls 20 records instead of ten
below is the expression:
JSON(input)
V_JSON_start(variable port):INSTR(JSON,'{',1,11)
V_JSON_end(variable port):instr(JSON,'}',1,20)
O_Json(output port):'['||substr(JSON,V_JSON_start,V_JSON_end)||']'
Kindly look into this and correct where i am missing.
input: flat file(csv with two fields region and country))
expected output:(5 sessions each session 10 records in json format)
eg., [{"region";"value","country":"value"},
{"region":"value","country":"value"}]
session1(csv to json) -->session 2--session3--session4--session5--session6(all parallel sessions using the file of 1st session 5 records in json format)

Values from xml data field in mysql

i would like to know if there is a query to select values from all of my xml data fields. There are around 1k rows which has xml data. All of them has almost the same data structure. With extract value i was able to extract one data field but at the point where more than one row is part of my subquery it breaks.
Here is an example xml data inside my db:
<EDLXML version="1.0.0" type="variable">
<properties id="template_variables">
<deliveredDuration>4444</deliveredDuration>
<deliveredNum>1</deliveredNum>
<comment/>
<projectname>cdfkusen</projectname>
<name>kral_schalke_trenink</name>
<order_id>372846</order_id>
<cutlistId>2763_ID</cutlistId>
<bcutlistId>51ddgf7a6-1268-1gdfged-95e6-5254000e8e1a</bcutlistId>
<num>1</num>
<duration>177760</duration>
<quotaRelevantDuration>0</quotaRelevantDuration>
<organisationUid>OrgName</organisationUid>
<organisationQuota>333221233</organisationQuota>
<organisationUsedQuota>123</organisationUsedQuota>
<organisationContingentIrrelevantQuotaUsed>54</organisationContingentIrrelevantQuotaUsed>
<userDbId>7xxxx84-eb9b-11fdsb-9ddd1-52cccccde1a</userDbId>
<userId>xxxx</userId>
<userRights>RH_DBC</userRights>
<firstName>DThom</firstName>
<lastName>Test</lastName>
<userMail>xxx#ccc.cz</userMail>
<language>English</language>
<orderTimestamp>1659448080</orderTimestamp>
<stitching>false</stitching>
<transcode>NO</transcode>
<destination>Standard</destination>
<collaboration>private</collaboration>
<premiumUser>false</premiumUser>
<priority>normal</priority>
<userMail2>xxx#ccc.cz</userMail2>
<cutlistItems>
<cutListId>125124_KFC</cutListId>
<cutListItemId cutlistItemDeliveryStatus="&#10004" cutlistItemDStatusMessage="delivered">112799</cutListItemId>
<bmarkerId>8f16ff80-1269-11ed-95e6-5254000e8e1a</bmarkerId>
<videoId>2912799</videoId>
<counter>1</counter>
<frameInSpecified>true</frameInSpecified>
<frameIn>15638</frameIn>
<frameOutSpecified>true</frameOutSpecified>
<frameOut>20082</frameOut>
<tcIn>00:10:25:13</tcIn>
<tcOut>00:13:23:07</tcOut>
<duration>177760</duration>
<BroadcastDate>2021-07-24</BroadcastDate>
<eventDate>2021-07-24</eventDate>
<resolutionFacet>HD</resolutionFacet>
<provider>DBC</provider>
<technicalrightholders>RH_DBC</technicalrightholders>
<rights>DBC</rights>
<materialType>DP</materialType>
<targetFilename>kral_schalke_trenink</targetFilename>
</cutlistItems>
</properties>
</EDLXML>
I got the right value from query if i do:
SELECT ExtractValue((SELECT job_xml from cutlist where job_xml is not null LIMIT 1), '//deliveredNum');
But when i change the limit amount i get back: Subquery return more than one row.
extractvalue expects two string arguments. When your subquery returns more than one row, you are not simply passing a string as the first argument (you are passing a set of results).
Instead of calling extractvalue once for your entire query, call it once for every row, like:
SELECT ExtractValue(job_xml, '//deliveredNum')
FROM cutlist
WHERE job_xml IS NOT NULL

Do not show JSON data in columns

The software I'm using saves a copy of the data that I think is json in an extra-different table when I do records in the database.
What I want to do is to be able to query the json data contained in the DATASETS column separately.
I'm using SQL 2012 as my server
This is the query I tried so far:
SELECT TOP 1 IND, SNAPSHOTDATE, DATASETS, USERNAME, OWNERFORM
FROM TBLSNAPSHOTS
CODE RESULT:
105 2018-09-14 02:59:34.000 { "Datasets": [{"Name": "TBLSTOKLAR","Lines": [{"IND": "102","STOKNO": "","MALINCINSI": "TITIZ PLASTIK BUYUK KASIK 10 ADET","STOKKODU": "8691262708050","ANABIRIM": "102","BIRIMEX": "102","ALTSEVIYE": "","KRITIKSEVIYE": "","USTSEVIYE": "","DEPOSEVIYESI": "True","URETICI": "","AYLIKVADE": "0","SERINO": "","DEPO": "1","STOKGRUBU": "","GARANTI": "0","PRIM": "0","IPTAL": "False","STOKTIPI": "0","STOKTAKIP": "0","TEMINYERI": "1","RAFOMRU": "0","RESIM": "","KALAN": "0","REZERV": "0","KOD1": "","KOD2": "","KOD3": "","KOD4": "","KOD5": "","KOD6": "","KOD7": "","KOD8": "","KOD9": "","KOD10": "","TAKSITSAYISI": "0","ISTIHBARAT": "","FIYATYOK": "","DELETED": "","ALISFIYATI": "0","ESKIALISFIYATI": "0","SONALISTARIHI": "","SONSATISTARIHI": "","KARTINACILMATARIHI": "14.09.18 ı. 02:57:58","DEVIRIND": "","MALIYET": "1","KDVGRUBU": "1","AKTIF": "False","ISCILIKIND": "0","ISCILIKBIRIMIND": "0","ISCILIKACIKLAMA": "","ISCILIKSTOKKODU": "","ALISFIYATIDEGISMETARIHI": "","STATUS": "1","DALISFIYATI": "","APB": "","OIV": "0","KARORANI": "0","OTV": "0","ISK": "0","STOKGRUPTANIMI": "","ISKSATISFIYATI2": "0","ISKSATISFIYATI3": "0","ALISKDVORANI": "18","ALISISKORANI": "","SIPARISALINMASIN": "False","SIPARISVERILMESIN": "False","P1": "","P2": "","P3": "","SATISKOSULU": "","DEFAULTALISFIYATI": "","DEFAULTALISFIYATIDEGISMESTARIHI": "","KDVGRUBUT": "","HEDEFSATISFIAYTI": "","KURUMISKONTOSU": "","TICARIISKONTO": "","ITSBILDIRIMI": "False","MAXISKORANI": "","IMALATCISATISFIYATI": "","DKUR": "1","ACILSEVK": "False","SOGUKSEVK": "False","ICMIKTAR": "","TICARISEKIL": "","MAXISKTUTAR": "","TAXE": "","KOD11": "","DAPB": "","IKINCIEL": "","ETICARET": "","STOKNEVI": "0","OTVORANSAL": "True","POZ": "","YAZARKASA": "False","KOD12": "","KOD13": "","KOD14": "","KOD15": "","KOD16": "","KOD17": "","KOD18": "","KOD19": "","KOD20": "","KOD21": "","UID": "{0DE71D73-E447-45B0-BF6A-1D312DBAFDD2}"}]}]} ADMIN frmEdtStok```
In SQL 2012 - no, you can't directly query the JSON. In SQL 2016 they added functions to let you do this:
https://learn.microsoft.com/en-us/sql/t-sql/functions/json-query-transact-sql?view=sql-server-2017
But if you need to stay on 2012 you are limited to String parsing it (don't do this), or writing/finding a CLR function which parses it using .Net code and returns the results
If you simply must do it quickly there are some hackey solutions to parse it like so: https://www.red-gate.com/simple-talk/sql/t-sql-programming/consuming-json-strings-in-sql-server/ but don't expect it to work smoothly with complex json

How to omit the header in when use spark to read csv.file?

I am trying to use Spark to read a csv file in jupyter notebook. So far I have
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()
This is how the reviews_df looks like:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]
But each row of the data frame contains the column names, how can I reformat the data, so that it can become:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]
Dataframe always returns Row objects, thats why when you issue collect() on dataframe, it shows -
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')
to get what you want, you can do -
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()
this will return you tuple of values of rows

feeding a dataframe created from a CSV to MLlib Kmeans: IndexError: list index out of range

because i can not use spark csv i have manually created a dataframe from CSV as follow:
raw_data=sc.textFile("data/ALS.csv").cache()
csv_data=raw_data.map(lambda l:l.split(","))
header=csv_data.first()
csv_data=csv_data.filter(lambda line:line !=header)
row_data=csv_data.map(lambda p :Row (
location_history_id=p[0],
user_id=p[1],
latitude=p[2],
longitude=p[3],
address=p[4],
created_at=p[5],
valid_until=p[6],
timezone_offset_secs=p[7],
opening_times_id=p[8],
timezone_id=p[9]))
location_df = sqlContext.createDataFrame(row_data)
location_df.registerTempTable("locations")
i need only two columns :
lati_longi_df=sqlContext.sql("""SELECT latitude, longitude FROM locations""")
rdd_lati_longi = lati_longi_df.map(lambda data: Vectors.dense([float(c) for c in data]))
rdd_lati_longi.take(2):
[DenseVector([-6.2416, 106.7949]),
DenseVector([-6.2443, 106.7956])]
now it seems that every thing is ready for KMeans training:
clusters = KMeans.train(rdd_lati_longi, 10, maxIterations=30,
runs=10, initializationMode="random")
but i get the following error:
IndexError: list index out of range
First three lines of ALS.csv:
location_history_id,user_id,latitude,longitude,address,created_at,valid_until,timezone_offset_secs,opening_times_id,timezone_id
Why don't you allow spark to parse csv instead? You can enable csv support with something like this:
pyspark --packages com.databricks:spark-csv_2.10:1.4.0