Spark :How to generate file path to read from s3 with scala - json

How do I generate and load multiple s3 file path in scala so that I can use :
sqlContext.read.json ("s3://..../*/*/*")
I know I can use wildcards to read multiple files but is there any way so that I can generate the path ? For example my fIle structure looks like this:
BucketName/year/month/day/files
s3://testBucket/2016/10/16/part00000
These files are all jsons. The issue is I need to load just spacific duration of files, for eg. Say 16 days then I need to loado files for start day ( oct 16) : oct 1 to 16.
With 28 day duration for same start day I would like to read from Sep 18
Can some tell me any ways to do this ?

You can take a look at this answer, You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Or you can use AWS API to get the list of files locations and read those files using spark .
You can look into this answer to AWS S3 file search.

You can generate comma separated path list:
sqlContext.read.json (s3://testBucket/2016/10/16/,s3://testBucket/2016/10/15/,...);

Related

file "(...).csv" not Stata file error in using merge command

I use Stata 12.
I want to add some country code identifiers from file df_all_cities.csv onto my working data.
However, this line of code:
merge 1:1 city country using "df_all_cities.csv", nogen keep(1 3)
Gives me the error:
. run "/var/folders/jg/k6r503pd64bf15kcf394w5mr0000gn/T//SD44694.000000"
file df_all_cities.csv not Stata format
r(610);
This is an attempted solution to my previous problem of the file being a dta file not working on this version of Stata, so I used R to convert it to .csv, but that also doesn't work. I assume it's because the command itself "using" doesn't work with csv files, but how would I write it instead?
Your intuition is right. The command merge cannot read a .csv file directly. (using is technically not a command here, it is a common syntax tag indicating a file path follows.)
You need to read the .csv file with the command insheet. You can use it like this.
* Preserve saves a snapshot of your data which is brought back at "restore"
preserve
* Read the csv file. clear can safely be used as data is preserved
insheet using "df_all_cities.csv", clear
* Create a tempfile where the data can be saved in .dta format
tempfile country_codes
save `country_codes'
* Bring back into working memory the snapshot saved at "preserve"
restore
* Merge your country codes from the tempfile to the data now back in working memory
merge 1:1 city country using `country_codes', nogen keep(1 3)
See how insheet is also using using and this command accepts .csv files.

converting parquet files in S3 to CSV and store back in S3

Information:
I have parquet files stored in S3 which I need to convert into CSV and store back into S3.
the way I have the parquet files structured in S3 is as so:
2019
2020
|- 01
...
|- 12
|- 01
...
|- 29
|- part-0000.snappy.parquet
|- part-0001.snappy.parquet
...
|- part-1000.snappy.parquet
...
The solution required:
Any AWS tooling (needs to use lambda, no EC2, ECS) (open to suggestions though)
That the CSV files keep their headers during conversion (if they are split up)
That the CSV retain are original information and have no added columns/information
That the converted CSV file remain around 50-100MB
The solution I have already tried:
"entire folder method"
Using Athena CREATE EXTERNAL TABLE -> CREATE TABLE AS on the entire data folder (e.g: s3://2020/06/01/)
fig: #1
CREATE EXTERNAL TABLE IF NOT EXISTS database.table_name (
value_0 bigint,
value_1 string,
value_2 string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '1' )
LOCATION 's3://2020/06/01' TBLPROPERTIES ('has_encrypted_data'='false')
fig: #2
CREATE TABLE database.different_table_name
WITH ( format='TEXTFILE', field_delimiter=',', external_location='s3://2020/06/01-output') AS
SELECT * FROM database.table_name
doing this "entire folder method" works at converting parquet to CSV but leaves the CSV files at around 1GB+ size which is way too large. I tried creating a solution to split up the CSV files (thanks to help from this guide) but it failed since lambda has a 15-minute limit & memory constraints which made it difficult to split about all these 1GB+ CSV files into about 50-100MB files.
"single file method"
using the same CREATE EXTERNAL TABLE (see fig: #1) and
fig: #3
CREATE TABLE database.different_table_name
WITH ( format = 'TEXTFILE', field_delimiter=',', external_location = 's3://2020/06/01-output') AS
SELECT *, "$path" FROM database.table_name
WHERE "$path" LIKE 's3://2020/06/01/part-0000.snappy.parquet';
doing this "single file method" required me to integrate AWS SQS to listen to events from S3 for objects created in the bucket which looked for .snappy.parquet. this solution converted the parquet to CSV and created CSVs which fit the size requirements. the only issue is that the CSVs were missing headers, and had additional fields which never existed in the parquet in the first place such as the entire bucket location.
You can use dask
import dask.dataframe as dd
df = dd.read_parquet(s3://bucket_path/*.parquet’)
#converting dask df to pandas df
df = df.compute()
df.to_csv(’out.csv’)
While there's no way to configure the output file sizes, you can control the number out files in each output partition when using CTAS in Athena. The key is to use the bucket_count and bucketed_by configuration parameters, as described here: How can I set the number or size of files when I run a CTAS query in Athena?. Run a few conversions and record the sizes of the Parquet and CSV files, and use that as a heuristic for how many buckets to configure for each job, each bucket will become one file.
When working with Athena from Lambda you can use Step Functions to avoid the need for the Lambda function to run while Athena is executing. Use the Poll for Job Status tutorial as a starting point. It's especially useful when running CTAS jobs since these tend to take longer to run.

NiFi merge CSV files using MergeRecord

i have a stream of JSON records that i convert it into CSV record successfully with this instruction. but now i want to merge this CSV records into one CSV file. below is that flow:
at step 5 i face with around 9K csv record, how do i merge it into one csv file using MergeRecord processor?
my csv header:
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11
some of this fields may be null and vary in records.
after this use UpdateAttribute configure it so that it can save the file with a filename and after that use putFile to store it to a specific location
I had a similar problem and solved it by using RouteonAttribute processor. Hope this helps someone.
Below is how I configure the processor using ${merge.count:equals(1)}

Load csv file with integers in Octave 3.2.4 under Windows

I am trying to import in Octave a file (i.e. data.txt) containing 2 columns of integers, such as:
101448,1077
96906,924
105704,1017
I use the following command:
data = load('data.txt')
However, the "data" matrix that results has a 1 x 1 dimension, with all the content of the data.txt file saved in just one cell. If I adjust the numbers to look like floats:
101448.0,1077.0
96906.0,924.0
105704.0,1017.0
the loading works as expected, and I obtain a matrix with 3 rows and 2 columns.
I looked at the various options that can be set for the load command but none of them seem to help. The data file has no headers, just plain integers, comma separated.
Any suggestions on how to load this type of data? How can I force Octave to cast the data as numeric?
The load function is not to read csv files. It is meant to load files saved from Octave itself which define variables.
To read a csv file use csvread ("data.txt"). Also, 3.2.4 is a very old version no longer supported, you should upgrade.

How to read a file where one column data is present in other column using Talend Data Integration

I get data from a CSV format daily.
Example data looks like:
Emp_ID emp_leave_id EMP_LEAVE_reason Emp_LEAVE_Status Emp_lev_apprv_cnt
E121 E121- 21 Head ache, fever, stomach-ache Approved 16
E139 E139_ 5 Attending a marraige of my cousin Approved 03
Here you can see that emp_leave_id and EMP_LEAVE_reason column data is shifted/scattered into the next columns.
So the problem by using tFileInputDelimited and various reading patterns I couldn't load data correctly into my target database. Mainly I'm not able to read the data correctly with that component in Talend.
Is there a way that I can properly parse this CSV to get my data in the format that I want?
This is probably a TSV file. Not sure about Talend, but uniVocity can parse these files for you:
TsvDataStoreConfiguration tsv = new TsvDataStoreConfiguration("my_TSV_datastore");
tsv.setLimitOfRowsLoadedInMemory(10000);
tsv.addEntities("/some/dir/with/your_files", "ISO-8859-1"); //all files in the given directory path will accessible entities.
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("my_Database", myDataSource);
database.setLimitOfRowsLoadedInMemory(10000);
Univocity.registerEngine(new EngineConfiguration("My_ETL_Engine", tsv, database));
DataIntegrationEngine engine = Univocity.getEngine("My_ETL_Engine");
DataStoreMapping dataStoreMapping = engine.map("my_TSV_datastore", "my_Database");
EntityMapping entityMapping = dataStoreMapping.map("your_TSV_filename", "some_database_table");
entityMapping.identity().associate("Emp_ID", "emp_leave_id").toGeneratedId("pk_leave"); //assumes your database does not keep the original ids.
entityMapping.value().copy("EMP_LEAVE_reason", "Emp_LEAVE_Status").to("reason", "status"); //just copies whatever you need
engine.executeCycle(); //executes the mapping.
Do not use a CSV parser to parse TSV inputs. It won't handle escape sequences properly (such as \t inside the value, you will get the escape sequence instead of a tab character), and will surely break if your value has a quote in it (a CSV parser will try to find the closing quote character and will keep reading chars until it finds another quote)
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).