Pattern match to identify date format - ssis

My source having different date formats as shown below, And im looking for an algorithm to identify the source date pattern tried in Pentaho Data integration with select value and Fuzzy steps.
Date Column (String)
"20150210"
"20050822--"
"2014-02-May"
"20051509--"
"02-May-2014"
"2013-May-12"
"12DEC2013"
"15050815"
"May-02-2014"
"12312015"
I know that in PDI we can achieve through JS step by writing If conditions for each pattern but is not a good idea and this approach makes transformation dead when dealing with huge records, looking out for efficient way to search date pattern.
I believe this is very common issue in all ETL projects, Here Im trying to understand how enterprise vendors like SAS Data Integration, Informatica, SSIS provides easy way to handle.
Do we have any Algorithm to identify source pattern. If so which one?
The formats that are listed above are not limited.

One cannot simply determine a "monovalent" value as the format for any given input.
Consider all of the following formats completely valid:
MM-dd-yy
dd-MM-yy
yy-MM-dd
As stated in a comment by #billinkc, what would you call 01-02-05 in that case?
If at all, your would be a solvable one only if you took a data set into account (e.g. you know that the next X rows are all from the same date format). Then you can look at it as a linear problem with some constraints that can help you determine the date format. Even then, you can't assure that you'll get a definite answer, just increase the probability that you'll have a definite answer.

Related

unable to import csv table DATEs columns to BigQuery

I am unable to import csv table > DATEs columns to BigQuery,
DATEs are not recognized, even they have correct format according this docu
https://cloud.google.com/bigquery/docs/schema-detect YYYY-MM-DD
So DATEs columns are not recognized and are renamed to _2020-0122, 2020-01-23...
Is the issue that DATES are in 1st row as column name ?
But How can I then import dates, when I want use them in TimeSeries Charts (DataStudio) ?
here is sample source csv>
Province/State,Country/Region,Lat,Long,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-026
Anhui,China,31.8257,117.2264,1,9,15,39,60
Beijing,China,40.1824,116.4142,14,22,36,41,68
Chongqing,China,30.0572,107.874,6,9,27,57,75
Here is ig from Bigquery
If you have finite number of days, you can try unpivot table when using it. See blog post.
otherwise, if you dont know how many day columns in csv file.
choose a unique character as csv delimiter then just load whole file into a single column staging table, then use split function. you'll also need unnest. This approach requires a full scan and will be more expensive, especially when file gets bigger.
The issue is that in column names you cannot have a date type, for this reason when the CSV is imported it takes the dates and transforms them to the format with underscores.
The first way to face the problem would be modifying the CSV file, because any import with the first row as a header will change the date format and then it will be harder to get to date type again. If you have any experience in any programming language you can do the transformation very easily. I can help doing this but I do not know your use case so maybe this is not possible. Where does this CSV come from?
If the CSV previous modification is not possible then the second option is what ktopcuoglu said, importing the whole file as one column and process this using SQL function. This is way harder than the first option and as you import all the data into a single column, all the data will have the same data type, what will be a headache too.
If you could explain where the CSV comes from we may be able to influence it before being ingested by BigQuery. Else, you'll need to deep into SQL a bit.
Hope it helps!
Hi, now I can help you further.
First I found some COVID datasets into the public bigquery datasets. The one you are taking from github is already in BigQuery, but there are many others that may work better for your task such as the one called “covid19_ecdc”, that is inside bigquery-public-data. This last one has the confirmed cases and deaths per date and country so it should be easy to make a time series.
Second, I found an interesting link performing what you meant with python and data studio. It’s a kaggle discussion so you may not be familiar with it, but it deserves a check for sure . Moreover, he is using the dataset you are trying to use.
Hope it helps. Do not hesitate to ask!

SSRS and Comparison Operators on Numeric Portion of varchar

Each returned transaction I am to report on is stored with a return reason code and a description of the return reason code. I built a tablix with two columns - one for return codes and another for descriptions. This works just peachy. The report owner is upset that a long list of codes will split pages - sigh. I was told to display them side-by-side.
I am new to t-sql and SSRS and its idiosyncrasies. I have minimal support from our DBAs. Two tables, filtered to display codes that meet a criteria sound simple enough.
My research:
MSDN's support network, Operators in Expressions page, and various help topics. I also found SO posts regarding split functions in t-sql and similar as well as one specifically asking about comparison and varchar. I found sites with helpful information like ResultData and Network Steve. I haven't found what I think I'm looking for.
My problem:
The return reason code is a varchar that always consists of the letter 'r' and two numeric digits (R00 to R99). It appears I can't run a comparison operator on an entire varchar that is alphanumeric; it doesn't recognize IIF((Fields!... <= R17),True,False). Additionally, the company will not allow the warehouse or its functions to be edited so I cannot create my own.
My solution ideas:
Add each Rnn code to the tablix filter, individually. This means ~50 filters per tablix and seems a sloppy or inefficient way of handling this
Separate the varchar string in to its alpha and numeric components and compare the latter using standard operators. This sounds the cleanest method but I'm unsure how to accomplish this in an expression or within SSRS
Forgo the two-table idea and create one table with four columns (code, description, code, description). This still leaves me with how to set a limit on the number of rows that can be created before 'spilling over' to the other side
I appreciate being pointed to any resources or any offered input to the issue and my (not so?)logical approach to it.
You can achieve your second option as follows:
CInt(Fields!ReturnCode.Value.Substring(1,2))

Is there any negatives to storing JSON in MySQL?

I am building a complex ordering system and I am struggling with whether I should store some of the more detailed information in a single column as JSON or if I should create the multiple tables and logic to keep JSON out of the picture.
Since each order will have multiple required dates, ship dates, parts, kits (collections of parts), and more. It just seems easier to store this as JSON of a single 'order'row.
Are there any major down sides to doing this?
JSON is geared more towards short term storage to send data from one thing to another. It is horribly inefficient space and computationally wise for long term storage compared to a database. You will also loose the ability to query the data directly without parsing it first (e.g "select * from table where orderdate < today"). You'll also have to develop your own tools to view the data, since if you try to view it in the database directly, everything will run together.
In short, this is almost always a really bad idea.

Formatting data for use with JSON and D3.js

I have the following data in my MySQL database. These three columns are a subset of a table that I have selected using a query.
Value Date Time
230.8 13/08/08 15:01:22+22
233.7 13/08/08 15:13:12+22
234.5 13/08/08 15:40:33+22
I want to represent this data on a graph of (Value) versus (Date & Time) in a chronological manner. What is the format I need to put the above data into before using JSON cause I've had a look at a few tutorials and when I apply the same logic (like this:http://www.d3noob.org/2013/02/using-mysql-database-as-source-of-data.html) I don't seem to be getting any graph at all.
Or will JSON and D3.js not work for my requirement? Do I need to look at something else? Like some other JavaScript?
Your question is a little bit vague, but I'll try to adress a few of your topics to help you get started.
Firstly, I would suggest finding the visualization that fits your needs. From the data subset that you showed in the question, I would suggest maybe this one. It is interesting because if you have multiple values for different times in a given day, you could construct various time series graphs and compare them interactively. There are other options, so you should explore and find a good starting point to improve and adapt to your needs.
Regarding the origin/format of the data, if you are able to extract that data you showed to a variable (with PHP, for example), you can then manipulate the data and build a structure from it. It doesn't necessarily have to be JSON and/or CSV. As long as you can handle it with d3.js's API functions. It isn't very difficult, but it is something that requires you to understand and read about the topic. First understand how to query for your needs with MySQL. Then, I would suggest starting here if you decide to go with JSON.
The example visualization I mentioned above uses a CSV file as a data source. Other option could be for instance to build a CSV file (or data structure - ie, an array) to feed into d3.js. There are various questions covering "how to create CSV with PHP", so you shouldn't have much difficulty finding the info you need.
Either way, after you feel confortable with what you know about these topics, start breaking your problem into smaller tasks and finding answers to one question at a time. If you need, post more questions here in SO and include your attempts at coding a solution, this will definitely get you all the help you might need.
in python it would look like this:
import json
output = json.dumps(['data', {'data_1': ('230.8', '13/08/08', '15:01:22+22')}, {'data_2': ('233.7', '13/08/08', '15:13:12+22')}, {'data_3': ('234.5', '13/08/08', '15:40:33+22')}])
print output
more information about python and json can be found here

MySQL - Convert Number to English Representation

Is there a particularly easy way to convert a number like "21.08" into "Twenty One and 08/100" using MySQL?
I know in Oracle there was a trick you could use that involved working with Julian dates. It would get the job done in a line or so but it doesn't appear to work in MySQL (since it doesn't support Julian dates).
It's not a particularly hard problem in a "real" programming language but the thought of writing it out as a stored procedure or function is dreadful.
Curious as to why you're doing this at the database layer instead of at the presentation layer...
If you really, really want to do this with MySQL, you could create two lookup tables called e.g. "ones" and "tens" that stored the English representation and then perform a query on each digit. Extract the digits by casting the number to a string and iterating backward from the decimal point, then performing a lookup in the appropriate table. Perhaps a third table could be used to supply strings like "Hundred", "Thousand", etc.
That's the most straightforward solution I can see, but it's going to be painful to write and probably quite brittle when it comes to internationalization. Also, it clutters the schema with lookup tables that don't have anything to do with your data.
Maybe writing a User-Defined Function (UDF) would be a better solution, though I imagine it will still be pretty time-consuming.