SIMDy

Blazingly fast computing in Python

SIMDy allow you to write high performance kernels directly from Python. A number of Python features can be used inside of a kernel. They include expressions, functions, user-defined classes, conditionals, arrays and loops. To achieve maximum possible performance SIMDy supports vector data types which allows CPU vector unit utilization.

SIMDy main features are:

  • automatic generation of AVX-512, FMA, AVX2, AVX, SSE instructions
  • high performance
  • on-the-fly code generation
  • native code generation
  • easy to write vectorized code

In example below bellow Monte Carlo simulation (MCS) is applied to Black-Scholes-Merton model. If your CPU have AVX-512 support this example is 4 times faster than C++ implementation.


import time
from multiprocessing import cpu_count
from simdy import simdy_kernel, float64x8, float64, int64

# Option Parameters
S0 = 105.00 # initial index level
K = 100.00 # strike price
T = 1. # call option maturity
r = 0.05 # constant short rate
vola = 0.25 # constant volatility factor of diffusion
NSamples = 10000000 # NSamples * 8 * cpu_count() samples is processed


@simdy_kernel
def rnd_gaus() -> float64x8:
    u = random_float64x8()
    r = sqrt(-2.0 * log(u))
    theta = 2.0 * 3.141592653589793 * random_float64x8()
    y = r * sin(theta)
    return y


@simdy_kernel(nthreads=cpu_count())
def black_scholes(S0: float64, K: float64, T: float64, r: float64, vola: float64, n: int64) -> float64:
    val = float64x8((r - 0.5 * vola * vola) * T)
    val2 = float64x8(sqrt(T) * vola)
    K8 = float64x8(K)
    C0 = float64x8(0.0)
    zero = float64x8(0)
    for i in range(n):
        ST = S0 * exp(val + val2 * rnd_gaus())
        C0 += max(ST - K8, zero)

    C0 = C0 * exp(float64x8(-r * T))
    result = C0[0] + C0[1] + C0[2] + C0[3] + C0[4] + C0[5] + C0[6] + C0[7]
    result = result / float64(n*8)
    return result

start = time.clock()
val = black_scholes(float64(S0), float64(K), float64(T), float64(r), float64(vola), int64(NSamples))
val = sum(val) / cpu_count()
end = time.clock()
print("Execution time ", end - start)
print("Result ", val)