Does "grepl" help to extract a vector of strings in R - grepl

Here is my data.frame named as "bird"
ID
Species
Habitat
1
Sunbird
Bamboo
2
Sunbird
Alpine
3
Sunbird
Meadow
4
Sparrow
Wetland, Bamboo
5
Whydah
Bamboo
6
Sparrow
Alpine, Wetland, Savannah, Bamboo
7
Firefinch
Savannah,Bamboo
Under bird$habitat, I only need to retain "Bamboo" and "Alpine" and get rid of "Wetland","Savannah",and Meadow".
However, I use the following syntax
dat1<-with(bird, bird[grepl('\b(Bamboo|Alpine)\b', bird$Habitat),]
However, all rows (e.g. 4,6 and 7) that contain "Bamboo" or "Alpine" reappear with unwanted names ( e.g."Wetland","Savannah",Meadow").
Any idea on how can I only retain only ("Bamboo", "Alpine") in rows (e.g. 4,6 and 7)
I tried
dat1<-with(bird, bird[grepl('\b(Bamboo|Alpine)\b', bird$Habitat),]

Related

I'm trying to de-duplicate an output CSV column by column, stack the results in a new CSV while also deduplicating them again

I've been working with Wireshark for a while now, and I've written scripts using tshark to pull specific types of data out and organize it in CSV files in Linux. Now, I'm trying to pull out all of the ports in a given day's collect and only show the unique ports used in that day, regardless of whether they were source or destination. An example of what I'm trying to have as a result:
Source
Destination
Output
1
2
1
14
4
2
22
2
14
14
6
4
1
4
22
22
8
6
8

MySQL for statistical results

I have a big task that I need some help with.
My goal is the following table structure:
type: type of car
number_of_cars: number of cars for each car type
number: number of people driving each car type
average: average number of people driving each car type
median: number of people driving each car type
max: max value of people driving each car type
min: min value of people driving each car type
standard deviation: standard deviation of number of people driving each car type
My data table looks like the following:
id type people
-----------------------
1 subaru 1
2 bmw 5
3 tesla 2
4 tesla 3
5 subaru 4
6 tesla 1
7 tesla 3
8 subaru 1
9 bmw 5
10 subaru 7
11 subaru 7
12 ford 2
13 ford 4
14 subaru 6
15 ford 3
16 tesla 2
17 tesla 1
18 tesla 1
19 tesla 1
Where id is a unique identifier, type is the type of car, and people is the number of people driving this car.
How do I create one giant MySQL query that gives me the results I need for my table?
Help is appreciated!
Ps. I know that MySQL is not necessarily the best approach to gather statictical data like this, but it should be possible, right?
All data the statistical data you want can be gathered using GROUP BY statement and built-in functions of mysql. Simply read about aggregate functions in mysql.
Only thing you won't find there is median. Mysql doesn't have built-in function for that but you can easily find some way to do that just by googling it

How to create a query based on condition with 3 joined tables in MySQL

I'm a bit confused when trying to create a specific query with the following data tables:
**table 1 - referral_data:**
ID attribution_name
------------------------
1 Category
2 Brand
3 Size
4 Color
5 Processor
6 OS
7 Screen Size
.....
**table 2 - referral_values:**
ID ref_data_id attribution_value
---------------------------------------------
1 1 Cell Phones
2 1 Tablets
3 1 Laptops
4 1 Computers
5 1 LCD Monitors
6 2 Nokia
7 2 Motorola
8 2 Samsung
9 2 Lenovo
10 2 Philips
11 3 10x10x11
12 3 100x100x20
13 3 10x200x200
14 3 2x2x3
15 4 Black
16 4 Cyan
17 4 Magenta
18 4 White
19 4 Blue
20 5 ARM Cortex A11
21 5 Snapdragon 11
22 5 Intel I3 XXXXX
23 5 Exynos XXXX
24 6 Android 4.1
25 6 Android 3.0
26 6 Windows Phone 3
27 6 Windows 8 Professional
28 7 18.5"
29 7 11.8"
30 7 7.0"
31 7 5.0"
32 7 3.5"
......
**table 3 - product_specs:**
ID product_id referral_data_id referral_value_id
--------------------------------------------------------
1 1050 1 1 // <-- Product 1 - Category: Cell Phones
1 1050 2 8 // <-- Product 1 - Brand: Samsung
1 1050 4 19 // <-- Product 1 - Color: Blue
1 1050 6 24 // <-- Product 1 - Processor: Exynos XXXX
1 1050 7 30 // <-- Product 1 - Screen Size: 7.0"
1 1068 1 4 // <-- Product 2 - Category: Computers
1 1068 2 9 // <-- Product 2 - Brand: Samsung
1 1068 6 22 // <-- Product 2 - Processor: Intel Core I3
1 1068 7 28 // <-- Product 2 - Screen Size: 18.5"
......
These tables consists in a "Product Catalog" that I'm planning to use in a e-commerce website.
This is intended to optimize "client side" search functions and organize internal product data information, turning the "content-administrators" tasks in a simplest and easiest environment as possible. (Letting them, for example, choosing an "already-entered" data values instead of re-entering an "already entered data", avoiding duplicated data or typo errors).
The "content administrators" will have options, according to that "table dynamics", to insert new attribution data (product characteristics) and or new attribution values to them (attribution values).
Info: the product code field named "product_id", is outside the tables relation, this is just used to create a link to attach informations to products that they belong to.
In general SQL Joins to get data over the tables, I'm Ok. But there some kind of informations that I need to get / manage, I'm getting nuts. I've spent A LOT of hours and I just have found a headache.
My question is about how to build, in a single query, to get commonly used referral data, based on a CATEGORY.
(when the contents admin choose, for a example, "Cell Phones" in "Category" field, they will get a commonly used "data table information" about that Category, like Brand, Color, Screen Size, etc .... to just choose their category attribution ) and to create a similar query to highlight or order by the commonly used attribution values ( i.e. commonly used screen sizes ).
In a single query?
SELECT ps2.product_id,rv2.attribution_value,rd.attribution_name
FROM ((
(select ps.* from product_specs ps JOIN referral_values rv
ON rv.ref_data_id=ps.referral_value_id
WHERE rv.attribution_value = 'Cell Phones'
) ps1
JOIN product_specs ps2 on ps1.product_id=ps2.product_id)
JOIN referral_values rv2 ON ps2.referral_value_id = rv2.id)
JOIN referral_data rd ON rd.id = rv2.ref_data_id;
The inner select is used to get the right product_id based on criteria 'Cell Phones'. The others are used to populate this value with all details. To do this the first JOIN is a self-JOIN of product_specs to get all data, the following two joins are used to get the string values of them.
By the way: The column product_specs.referral_data_id is redundant and can/should be removed

How to apply a formula for removing data noise in R?

I am working on NGSim Traffic data, having 18 columns and 1180598 rows in a text file. I want to smooth the position data, in the column 'Local Y'. I know there are built-in functions for data smoothing in R but none of them seem to match with the formula I am required to apply. The data in text file looks something like this:
Index VehicleID Total_Frames Local Y
1 2 5 35.381
2 2 5 39.381
3 2 5 43.381
4 2 5 47.38
5 2 5 51.381
6 4 8 504.828
7 4 8 508.325
8 4 8 512.841
9 4 8 516.338
10 4 8 520.854
11 4 8 524.592
12 4 8 528.682
13 4 8 532.901
14 5 7 39.154
15 5 7 43.153
16 5 7 47.154
17 5 7 51.154
18 5 7 55.153
19 5 7 59.154
20 5 7 63.154
The above data columns are just example taken out of original file. Here you can see 3 vehicles, with vehicle IDs = 2, 4 and 5 but in fact there are 2169 vehicles with different IDS. The column Total_Frames tell us how many times vehicle Id of each vehicle is repeated in the first column, for example in the table above, vehicle ID 2 is repeated 5 times, hence '5' in Total_Frames column. Following is the formula I am required to apply to remove data noise (smoothing) from column 'Local Y':
Smoothed Position Value = (1/(Summation of [EXP^-abs(i-k)/delta] from k=i-D to i+D)) * ( (Summation of (Local Y) *[EXP^-abs(i-k)/delta] from k=i-D to i+D))
where,
i = index #
delta = 5
D = 15
I have tried using the built-in functions, which I know of, but they don't smooth the data as required. My question is: Is there any built-in function in R which can do the data smoothing in the way of given formula or which could take this formula as an argument? I need to apply the formula to every value in Local Y which has 15 values before and 15 values after them (i-D and i+D) for same vehicle Id. Can anyone give me any idea how to approach the problem? Thanks in advance.
You can place your formula in a function and then use the apply function of R to apply it to the elements in your "Local Y" column of the dataframe

How to build vtkPolyData based on the information within a txt file

I have a txt file which contains a set of 3 Dimensional data points and I would like to create a vtkPolyData based on those points.
In the file, I have the number of points on the first line, in my case they are 6 x 6. And after that the actual coordinates of each point. The content of the file is like this.
6 6
1 1 3
2 1 3.4
3 1 3.6
4 1 3.6
5 1 3.4
6 1 3
1 2 3
2 2 3.8
3 2 4.2
4 2 4.2
5 2 3.8
6 2 3
1 3 3
2 3 3
3 3 3
4 3 3
5 3 3
6 3 3
1 4 3
2 4 3
3 4 3
4 4 3
5 4 3
6 4 3
1 5 3
2 5 3.8
3 5 4.2
4 5 4.2
5 5 3.8
6 5 3
1 6 3
2 6 3.4
3 6 3.6
4 6 3.6
5 6 3.4
6 6 3
How can I build a vtkPolyData structure with a txt file with this data?
It looks to me like you have a regularly gridded series of points, right? If so, vtkImageData might be a better choice. You can always use a geometry filter afterwards to convert to polydata if you really need it that way.
Create a vtkImageData instance.
Set its dimensions to (6, 6, 1) (the third dimension is ignored).
Set its data type to an appropriate type (float or double, I guess).
Call AllocateScalars();
If in C++,
call GetScalarPointer() and cast it to the data type set in 3.
This pointer will point to an array of size 36. You can just fill each point as you would normally.
If in another language (TCL/Python/Java), call SetScalarComponentFromFloat on the image data, with the arguments (x, y, 0, 0, value). The first 0 is the 3rd dimension and the second is for the first component.
This will give you a grid, and it'll be far more memory efficient than a polydata.
If you want to visualize only the points, use a vtkDataSetMapper, and setup the actor's property with SetRepresentationToPoints(), setting an appropriate point size. That will do a simple job of visualization.
Are these examples useful? In particular, this does generation of points and polygons, so it should be possible to adapt. The core seems to be (with lots left out):
# ...
vtkPolyData shell
vtkFloatPoints points
vtkCellArray strips
# Generate points...
loop {
...
points InsertPoint $k $x0 $x1 $x2
}
shell SetPoints points
points Delete
# Generate triangles/polygons...
loop {
strips InsertNextCell $NP2
# ...
strips InsertCellPoint [expr $kb +$ke ]
# ...
strips InsertCellPoint [expr $kb +$ke ]
}
shell SetStrips strips
strips Delete
# ...