Assume millions of lines of traffic data in SQL format.
From the column URL and for each row of given range, I want to get a substring text that matches the target tag.
For example, from the column URL, I have the following texts:
Column: `URL`
Row 1: http://www.google.com/abcdeft?&QQ=123&AA=america&YY=111
Row 2: http://www.google.com/abcdeft?&QQ=123&AA=asia&YY=111
Row 3: http://www.google.com/abcdeft?&QQ=123&AA=africa&YY=111
Row 4: http://www.google.com/abcdeft?&QQ=123&AA=south&YY=111
Row 5: http://www.google.com/abcdeft?&QQ=123&AA=south&YY=111
Row 6: http://www.google.com/abcdeft?&QQ=123&AA=&YY=111
Row 7: http://www.google.com/abcdeft?&QQ=123
...
Row 99999999: http://www.google.com/abcdeft?&QQ=123&AA=ddd&YY=111
Data keep being loaded with lots of updates. So performance does matter. My goal is to:
Identify each row with its unique key-tag &AA=. Basically I need to get the string in the tag &AA= from every single row. For example, I want africa from ~~&AA=africa&~~. None if there is no &AA= but still need to read every single row.
Identify duplicate rows that contain the same tag in &AA=. e.g. row 4 and 5 are duplicates because they have same AA tags of south.
Question: which would be the best way for future data processing?
Option 1. Without URL column
Read every single row in URL column
Parse each row for the tag &AA= using urlparse library
Need a separate script to find duplicate rows with the same AA tag. e.g. using Python, I need to make a list of all items(all tags) and find the duplicate items in the list.
Need a separate query to find the rows that contain duplicate tags. e.g. query the rows that contain the duplicate items in the column URL
Creating separate column specifically for this task seems relatively doable.
Option 2. Insert another new column AA for tag &AA= and start filling out the new column when updating traffic data.
In this way:
No need to Read the column URL
No need to Parse the text in URL to get the tag &AA=
No need to Find duplicate items from one query
- No need to etrieve rows with duplicate items from another query
In this way, we can easily:
Get &AA= data just selecting the column AA
SELECT duplicate rows using COUNT function in SQL
Which one would perform better?
If you can stand the extra space cost of having an additional column then that would be the optimal approach. If there are a lot of duplicates of AA you might consider putting that in another table and then joining to it for queries. That would cut down on the space cost and still give you all the flexiblity. it would make it even easier (faster to query) if you were querying on an ID instead of the textual value of AA.
Related
I am trying to write query that would go through whole table and count cells which contains e.g. "1/23/20" but i have found only ways how to search one specific column. Is it possible to search all columns?
This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)
I have a data set where i'm using a table to display Name, Radio #, and Unit # information in SSRS tablix. As some of the groups have 60+ members, i thought it would be better to expand the tables into 4 columns repeating those detail fields instead of displaying a 3 page long skinny table. In the SQL i used a row count%4 function to assign a "position" number 0-3 for each name. If i create a table with the detail members above and then add a parent column group on position, i get the tables repeated as i want but each name/radio/unit appears on a unique row. I've tried several different ways of grouping rows/columns but always seem to get this staggered table (with only name/radio to make it easier to digest): sample_pic
Sorry if this is a duplicate. I've really searched quite a bit before putting this in but it's probably the case that if i knew what to search, i wouldn't be putting this question in. So if you'd rather tell me what to search i can do that too. :)
SSRS will display a row in the table for each row returned from the dataset, this is normal behaviour for data to display.
One way to get what you want is to create a query which has all the information form your column headings in one row, probably with a pivot or similar.
Or you could just display your columns in separate tables.
I am a newbie with access and I am trying to import records into several tables from an excel file. Each row in excel has different number of columns, but the good thing is column A is able to help me to identify what records need to go to my different tables.
Sample table
As you can see in the picture, Row 1 Column A has the value of "H", which would indicate that this record needs to go to the "H" table. Then the next few rows have a value of "R" in Column A which indicates that these records should go to the "R" table, and so on and so forth. However, the number of records to be imported into each table will vary all the time. Like the sample above rows 2 through 10 belong to the table R, but the next import may have only 5 or 20 records.
Currently I am using a temporary table and using an append query for each table but I am wondering if there is an easier way via VBA or other method that could be faster and more efficient.
Thanks!
The way you are doing it now may be the best way. An alternative would be to do this in two steps:
1) split your column A, and parse out to different sheets (or different workbooks).
http://www.rondebruin.nl/win/s3/win006.htm
2) load those different sheets (or workbooks) into different tables.
http://www.accessmvp.com/KDSnell/EXCEL_Import.htm#ImpAllWktsSepTbl
http://www.accessmvp.com/KDSnell/EXCEL_Import.htm#ImpFldWrkFiles
I am trying to create a dynamic table - I have tried a Pivot Table, but cannot get it to work. So I thought that maybe it could be done with an IF-statement, but that did not work for me neither.
Basically, I have 2 tables, 1 table containing the information (data source table) and 1 table that should be dynamic according to the data in the first table.
So if I change the data in the E-column, the Fruit table (image below) must be updated accordingly.
So if I write 2 instead of 1 in the count of Apples, then it should create 2 apples under the "Fruit"-column". Data in the remaining columns will be calculated with a formula/fixed data - so that is not important.
I am open to any solutions; formulas, pivot tables, VBA, etc.
Have a nice weekend.
I have both Excel 2010 and 2013.
If you want to repeat some text a number of times you can use a somewhat complicated formula to do it. It relies on there not being duplicate entries in the Fruits table and no entries with 0 count.
Picture of ranges and results
Formulas involved include a starter cell E2 and a repeating entry E3 and copied down. These are actually normal formulas, no array required. Note that I have created a Table for the data which allows me to use named fields to get the whole column.
E2 = INDEX(Table1[Fruits],1)
E3 = IF(
INDEX(Table1[Count],MATCH(E2,Table1[Fruits],0))>COUNTIF($E$2:E2,E2),
E2,
INDEX(Table1[Fruits],MATCH(E2,Table1[Fruits],0)+1))
How it works
This formula relies on checking the number of entries above the current one and comparing to the desired count. Some notes:
The starter cell is needed to get the first result.
After the first cell, it counts how often the value above appears in the total list. This is compared to the desired count. If less than desired, it will repeat the value from above. If greater, it will go to the next item in the list. There is a dual relative/absolute reference in here to count cells above.
Since it goes to the next item in the list, don't put a 0 for a count or it will get included once.
You can copy this down for as many cells as you want. It will #REF! when it runs out of data. You can wrap in an IFERROR(..., "") to make these display pretty.
If the non-0 rule is too much, it can probably be removed with a little effort. If there are duplicates, that will be much harder to deal with.