Have 100 csv unstructured files and need to load data in single variant column. code posted below will create two rows if two rows are present in the file and requirement is create single row to store data from two rows. what changed I can make in the code?
Table will contain data in DATA column
CREATE OR REPLACE TABLE rtf_lines
(
LOADED_AT timestamp,
FILENAME string,
FILE_ROW_NUMBER int,
DATA VARIANT
);
copy data into table and JSON object to support up to 20 CSV columns, it can be extended
COPY INTO rtf_lines
from
(
SELECT
CURRENT_TIMESTAMP as LOADED_AT,
METADATA$FILENAME as FILENAME,
METADATA$FILE_ROW_NUMBER as FILE_ROW_NUMBER,
object_construct(
'col_001', T.$1, 'col_002', T.$2, 'col_003', T.$3, 'col_004', T.$4,
'col_005', T.$5, 'col_006', T.$6, 'col_007', T.$7, 'col_008', T.$8,
'col_009', T.$9, 'col_010', T.$10, 'col_011', T.$11, 'col_012', T.$12,
'col_013', T.$13, 'col_014', T.$14, 'col_015', T.$15, 'col_016', T.$16,
'col_017', T.$17, 'col_018', T.$18, 'col_019', T.$19, 'col_020', T.$20
) as data
FROM #%rtf_lines T
)
FILE_FORMAT =
(
TYPE = JSON
RECORD_DELIMITER = '\n'
ESCAPE_UNENCLOSED_FIELD = NONE
FIELD_OPTIONALLY_ENCLOSED_BY='0x22'
EMPTY_FIELD_AS_NULL=FALSE
);
Code will output as:-
Row 1
LOADED_AT 2022-06-02 06:09:57.363
FILENAME #RTF_LINES/ui1654167360506/rtf_snowflake_sample.csv
FILE_ROW_NUMBER 1
DATA { "col_001": "NDTV.com provides latest news", "col_002": " videos from India and the world. Get today’s news headlines from Business", "col_003": " " }
Row 2
LOADED_AT 2022-06-02 06:09:57.363
FILENAME #RTF_LINES/ui1654167360506/rtf_snowflake_sample.csv
FILE_ROW_NUMBER 2
DATA { "col_001": "Technology", "col_002": " Sports", "col_003": " Movies", "col_004": " videos", "col_005": " photos", "col_006": " live news coverage and exclusive breaking news from India.}
Output expected as :-
Row 1
LOADED_AT 2022-06-02 06:09:57.363
FILENAME #RTF_LINES/ui1654167360506/rtf_snowflake_sample.csv
FILE_ROW_NUMBER 1
DATA { "col_001": "NDTV.com provides latest news", "col_002": " videos from India and the world. Get today’s news headlines from Business", "col_003": " ",
"col_004": "Technology", "col_005": " Sports", "col_006": " Movies", "col_007": " videos", "col_008": " photos", "col_009": " live news coverage and exclusive breaking news from India.}
Related
I use requests to pull json files of companies. How I add a ticker column and json string in a csv file (separated by comma) so I can import the csv file into postgresql?
My python code:
ticker_list = ['AAPL','MSFT','IBM', 'APD']
for ticker in ticker_list:
url_profile = fmp_url + profile + ticker + '?apikey=' + apikey
#get data in json array format
json_array = requests.get(url_profile).json()
# for each record within the json array, use json.dumps to turn it into an json string.
json_str = [json.dumps(element) for element in json_array]
#add a ticker colum and write both ticker and json string to a csv file:
with open ("C:\\DATA\\fmp_profile_no_brackets.csv","a") as dest:
for element in json_str:
dest.writelines (ticker_str + ',' + element + '\n' )
In postgres I have table t_profile_json with 2 columns:
ticker varchar(20) and profile jsonb
when I copy the file fmp_profile into postgres by using:
copy fmp.t_profile_json(ticker,profile) from 'C:\DATA\fmp_profile.csv' delimiter ',';
I have this error:
ERROR: extra data after last expected column
CONTEXT: COPY t_profile_json, line 1: "AAPL,{"symbol": "AAPL", "price": 144.49, "beta": 1.219468, "volAvg": 88657734, "mktCap": 22985613828..."
SQL state: 22P04
The copy command seems to add both "AAPL, json string.." as one string.
I did something wrong at the "dest.writelines (ticker_str + ',' + element + '\n' )", but I don't know how to correct it.
Thank you so much in advance for helping!
I have comma separated file in .csv format
name,address,zip
Ram,"123,ave st",1234
While moving the data to hdfs and creating hive table in comma separated, facing column shift.
What properties in Hive will fix this issue?
name - Ram
address - "123
zip - ave st"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = ",",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE
LOCATION
'hdfs://path'
It works..
I have 2 databases in MySQL:
1) An input Latitude-Longitude_dB ('latlong_db', henceforth): It has the latitude and longitude of each reading from a GPS tracking device.
2) A Weather_db: I read the input latlongs from dB1, and calculate 'current' weather data for each pair of latlongs (eg: humidity, cloud_coverage) . This weather data is written into a Weather_db.
The issue is: I need to keep track of which record (which 'input latlong') was read last. This is so that I don't recalculate weather_data for the latlongs that I've already covered. How do I keep track of the last read input_latlong?
Thank you so much.
Edit:
1) For those who have been asking about the 'database v/s table' question, the answer is that I am reading from 1 database and writing into the 2nd database. The 'config.json' to connect to the 2 databases is as follows:
{
"Tracker_ds_locallatlongdb": {
"database": "ds_testdb1",
"host": "XXXXXXXXXXX",
"port": XXXX,
"user": "XXXX",
"password": "XXXXX"
},
"Tracker_ds_localweatherdb": {
"database": "ds_testdb2",
"host": "XXXXXXX",
"port": XXXX,
"user": "XXXX",
"password": "XXXXX"
}
}
2) My Python script to read from the input_latlong_db and write into the weather_db is outlined as follows. I am using OpenWeatherMap API to calculate weather data for given latitudes and longitudes:
from pyowm import OWM
import json
import time
import pprint
import pandas as pd
import mysql.connector
from mysql.connector import Error
api_key = 'your api key'
def get_weather_data(my_lat, my_long):
owm= OWM(api_key)
obs= owm.weather_at_coords(my_lat.item() , my_long.item() ) #Use: <numpy.ndarray>.item:
w= obs.get_weather()
l= obs.get_location()
city= l.get_name()
cloud_coverage =w.get_clouds()
.
.
.
w_datatoinsert= [my_lat, my_long, w_latitude, w_longitude, city, weather_time_gmt,call_time_torontotime,
short_status, detailed_status,
temp_celsius, cloud_coverage, humidity, wind_deg, wind_speed,
snow, rain, atm_pressure, sea_level_pressure,sunset_time_gmt ] #15 + act_latitude + act_longitude
return w_datatoinsert
# ------------------------------------------------------------------------------------------------------------------------------------
spec_creds_1= {}
spec_creds_2= {}
def operation():
with open('C:/Users/config.json') as config_file:
creds_dict= json.load(config_file)
spec_creds_1= creds_dict['Tracker_ds_locallatlongdb']
spec_creds_2= creds_dict['Tracker_ds_localweatherdb']
try:
my_conn_1= mysql.connector.connect(**spec_creds_1 )
if (my_conn_1.is_connected()):
info_1= my_conn_1.get_server_info()
print("Connected ..now reading the local input_latlong_db: ", info_1)
try:
my_conn_2= mysql.connector.connect(**spec_creds_2)
if (my_conn_2.is_connected()):
info_2= my_conn_2.get_server_info()
print('Connected to write into the local weather_db: ', info_2)
cursor_2= my_conn_2.cursor()
readings_df= pd.read_sql("SELECT latitude, longitude FROM readings_table_19cols;", con= my_conn_1)
for index, row in readings_df.iterrows():
gwd= get_weather_data(row['latitude'], row['longitude'])
q= "INSERT INTO weather_table_19cols VALUES(" + ",".join(["%s"]*len(gwd)) + " ); "
cursor_2.execute(q, gwd)
my_conn_2.commit()
except Error as e:
print("Error while connecting to write into the local weather_db: ", e)
finally:
if (my_conn_2.is_connected()):
cursor_2.close()
my_conn_2.close()
print("Wrote 1 record to the local weather_db.")
except Error as e:
print("Error connecting to the local input latlong_db: ", e)
finally:
if (my_conn_1.is_connected()):
my_conn_1.close() # no cursor present for 'my_conn_1'
print("Finished reading all the input latlongs ...and finished inserting ALL the weather data.")
#-------------------------------------------------------------------------------
if __name__=="__main__":
operation()
In the Input_latlong_table- the readings_table_19cols: I created an Autoincremented readings_id as the Primary key , and a column called read_flag whose default value was 0.
In the weather_table_19cols: I created an Autoincremented weather_id as the Primary key.
Since my method involves reading an input latlong record and correspondingly writing its weather data into the weather_table, I compared the index of the readings_table_19cols and the weather_table_19cols. If they matched, that means that the input record was read, and I would set the read_flag to 1.
..
for index_1, row_1 in readings_df_1.iterrows():
gwd= get_weather_data(row_1['imei'], row_1['reading_id'] ,row_1['send_time'],row_1['latitude'], row_1['longitude'])
q_2= "INSERT INTO weather_table_23cols(my_imei, my_reading_id, actual_latitude, actual_longitude, w_latitude, w_longitude, city, weather_time_gmt, OBD_send_time_gmt, call_time_torontotime, \
short_status, detailed_status, \
temp_celsius, cloud_coverage, humidity, wind_deg, wind_speed, \
snow, rain, atm_pressure, sea_level_pressure,sunset_time_gmt) VALUES(" + ",".join(["%s"]*len(gwd)) + " ); "
q_1b= "UPDATE ds_testdb1.readings_table_22cols re, ds_testdb2.weather_table_23cols we \
SET re.read_flag=1 WHERE (re.reading_id= we.weather_id);" # use the prefix 'db_name.table_name' if 1 cursor is being used for 2 different db's
cursor_2.execute(q_2, gwd)
my_conn_2.commit()
cursor_1.execute(q_1b) # use Cursor_1 for 1st query
my_conn_1.commit()
I have a large database that contains data about embedded devices in the field.
I've built a MySQL query that outputs data in this format, called "/tmp/data.csv":
Device_Serial_Number, Device_Location_1, Device_Location_2, Date_1, Date_2
"3782D822", "Springfield, MA", "123 Maple Street", "2016-05-02 13:43:00", "2016-05-05 03:22:44"
. . .
The output is thousands of lines long. Note that an individual Device_Serial_Number value can appear multiple times, each with a unique set of "Date_1", "Date_2" values.
What I need to do is create separate .csv files for each value of "Device_Location_1". On that report, each unique value of "Device_Serial_Number" has only 1 row, but all values of "Date_1" and "Date_2" associated with that "Device_Serial_Number" on the entire spreadsheet will appear on that same row.
Example:
Device_Serial_Number, Device_Location_1, Device_Location_2, Date_1, Date_2, Date_1, Date_2, Date_1, Date_2
"3782D822", "Springfield, MA", "123 Maple Street", "2016-05-02 13:43:00", "2016-05-05 03:22:44", "2016-05-06 12:45:23", "2016-05-06 14:23:11", "2016-05-17 15:46:21", "2016-05-18 08:09:13"
Do do this I'm trying to use AWK within a Bash script. I have used a second MySQL query to get a list of unique device serial numbers, and have saved the results as "/tmp/devList.csv". I am attempting to read through each line of "devList.csv", and append a String variable with date strings that match that device list as found in "data.csv", then assign the Device_Serial_Number as the index on an associative array and the String of dates as the value.
Obviously this isn't working. I feel like this solution is way too complicated. Any help finding a working solution would be greatly appreciated.
awk -F, -v deviceList='/tmp/devList.csv' 'BEGIN { OFS=","; while (getline < deviceList) { device[$0]= "" } }
{
dates = $4 "," $5 ","
holder = devices[$1]
newValue = holder dates
device[$1] = newValue
}
END {
for (i in device)
if (device[$i] != "")
print > "/tmp/test_output.csv"
}' '/tmp/data.csv'
I have file that the content of file is as bellow, I have only output two records here but there is around 1000 record in single file:
Record type : GR
address : 62.5.196
ID : 1926089329
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
inID : 101
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:51:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
--------------------------------------------------------------------
Record type : GR
address : 61.5.196
ID : 1926089327
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
intID : 100
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:55:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
--------------------------------------------------------------------
Record type : GR
address : 63.5.196
ID : 1926089328
time : Sun Aug 10 09:53:47 2014
Time zone : + 16200 seconds
address [1] : 61.5.196
PN ID : 412 1
---------- Container #1 (start) -------
intID : 100
---------- Container #1 (end) -------
timerecorded: Sun Aug 10 09:55:47 2014
Uplink data volume : 502838
Downlink data volume : 3133869
Change condition : Record closed
my Goal is to convert this to CSV or txt file like bellow
Record type| address |ID | time | Time zone| address [1] | PN ID
GR |61.5.196 |1926089329 |Sun Aug 10 09:53:47 2014 |+ 16200 seconds |61.5.196 |412 1
any guide would be great on how you think would be best way to start this, the sample that I provided I think will give the clear idea but in words I would want to read the header of each record once and put their data under the out put header.
thanks for your time and help or suggestion
What you're doing is creating an Extract/Transform script (the ET part of an ETL). I don't know which language you're intending to use, but essentially any language can be used. Personally, unless this is a massive file, I'd recommend Python as it's easy to grok and easy to write with the included csv module.
First, you need to understand the format thoroughly.
How are records separated?
How are fields separated?
Are there any fields that are optional?
If so, are the optional fields important, or do they need to be discarded?
Unfortunately, this is all headwork: there's no magical code solution to make this easier. Then, once you have figured out the format, you'll want to start writing code. This is essentially a series of data transformations:
Read the file.
Split it into records.
For each record, transform the fields into an appropriate data structure.
Serialize the data structure into the CSV.
If your file is larger than memory, this can become more complicated; instead of reading and then splitting, for example, you may want to read the file sequentially and create a Record object each time the record delimiter is detected. If your file is even larger, you might want to use a language with better multithreading capabilities to handle the transformation in parallel; but those are more advanced than it sounds like you need to go at the moment.
This is a simple PHP script that will read a text file containing your data and write a csv file with the results. If you are on a system which has command line PHP installed, just save it to a file in some directory, copy your data file next to it renaming it to "your_data_file.txt" and call "php whatever_you_named_the_script.php" on the command line from that directory.
<?php
$text = file_get_contents("your_data_file.txt");
$matches;
preg_match_all("/Record type[\s\v]*:[\s\v]*(.+?)address[\s\v]*:[\s\v]*(.+?)ID[\s\v]*:[\s\v]*(.+?)time[\s\v]*:[\s\v]*(.+?)Time zone[\s\v]*:[\s\v]*(.+?)address \[1\][\s\v]*:[\s\v]*(.+?)PN ID[\s\v]*:[\s\v]*(.+?)/su", $text, $matches, PREG_SET_ORDER);
$csv_file = fopen("your_csv_file.csv", "w");
if($csv_file) {
if(fputcsv($csv_file, array("Record type","address","ID","time","Time zone","address [1]","PN ID"), "|") === FALSE) {
echo "could not write headers to csv file\n";
}
foreach($matches as $match) {
$clean_values = array();
for($i=1;$i<8;$i++) {
$clean_values[] = trim($match[$i]);
}
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
}
fclose($csv_file);
} else {
die("could not open csv file\n");
}
This script assumes that your data records are always formatted similar to the examples you have posted and that all values are always present. If the data file may have exceptions to those rules, the script probably has to be adapted accordingly. But it should give you an idea of how this can be done.
Update
Adapted the script to deal with the full format provided in the updated question. The regular expression now matches single data lines (extracting their values) as well as the record separator made up of dashes. The loop has changed a bit and does now fill up a buffer array field by field until a record separator is encountered.
<?php
$text = file_get_contents("your_data_file.txt");
// this will match whole lines
// only if they either start with an alpha-num character
// or are completely made of dashes (record separator)
// it also extracts the values of data lines one by one
$regExp = '/(^\s*[a-zA-Z0-9][^:]*:(.*)$|^-+$)/m';
$matches;
preg_match_all($regExp, $text, $matches, PREG_SET_ORDER);
$csv_file = fopen("your_csv_file.csv", "w");
if($csv_file) {
// in case the number or order of fields changes, adapt this array as well
$column_headers = array(
"Record type",
"address",
"ID",
"time",
"Time zone",
"address [1]",
"PN ID",
"inID",
"timerecorded",
"Uplink data volume",
"Downlink data volume",
"Change condition"
);
if(fputcsv($csv_file, $column_headers, "|") === FALSE) {
echo "could not write headers to csv file\n";
}
$clean_values = array();
foreach($matches as $match) {
// first entry will contain the whole line
// remove surrounding whitespace
$whole_line = trim($match[0]);
if(strpos($whole_line, '-') !== 0) {
// this match starts with something else than -
// so it must be a data field, store the extracted value
$clean_values[] = trim($match[2]);
} else {
// this match is a record separator, write csv line and reset buffer
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
$clean_values = array();
}
}
if(!empty($clean_values)) {
// there was no record separator at the end of the file
// write the last entry that is still in the buffer
if(fputcsv($csv_file, $clean_values, "|") === FALSE) {
echo "could not write data to csv file\n";
}
}
fclose($csv_file);
} else {
die("could not open csv file\n");
}
Doing the data extraction using regular expressions is one possible method mostly useful for simple data formats with a clear structure and no surprises. As syrion pointed out in his answer, things can get much more complicated. In that case you might need to write a more sophisticated script than this one.