Structs, C#7 and Performance Improvement 

C# 7 provides a very powerful feature from the performance standpoint. This feature is returning by ref. Essentially this allows for value types to be returned without having to copy them.

The guidelines are normally that you shouldn’t use a struct with too many fields. Various sources quote various size guidelines. Whenever the struct was over the prescribed size it was recommended that you pass it by reference.

With the new syntax for returning types by reference, it’s now more convenient (no more methods returning via out parameter) to use structs. 

In performance critical scenarios where you need to avoid polluting managed heap with too many Gen #0 objects using structs has now become more natural. In the past dealing with structs was somewhat cumbersome if you dealt with a large number of fields and needed to avoid copying of values. 

I have worked on a large application at McLaren – Telemetry Acquisition System that is supplied to all teams. The performance of the application is very critical as it has to process gigabytes of telemetry data. We have used structs extensively to squeeze out every bit of performance from .NET runtime.

I think it’s my second favourite feature after value tuples.

TPL Dataflow

Overview

The Task Parallel Library (TPL) was introduced in the .NET Framework 4, providing core building blocks and algorithms for parallel computation and asynchrony.  This work was centered around the System.Threading.Tasks.Task type, as well as on a few higher-level constructs.  These higher-level constructs address a specific subset of common parallel patterns, e.g. Parallel.For/ForEach for delightfully parallel problems expressible as parallelized loops.

While a significant step forward in enabling developers to parallelize their applications, this work did not provide higher-level constructs necessary to tackle all parallel problems or to easily implement all parallel patterns.  In particular, it did not focus on problems best expressed with agent-based models or those based on message-passing paradigms.  These kinds of problems are quite prevalent in technical computing domains such as finance, biological sciences, oil & gas, and manufacturing.

For TPL Dataflow (TDF), we build upon the foundational layer provided in TPL in .NET 4. TDF is a complementary set of primitives to those primitives delivered in TPL in .NET 4, addressing additional scenarios beyond those directly and easily supported with the original APIs.  TPL Dataflow utilizes tasks, concurrent collections, tuples, and other features introduced in .NET 4 to bring support for parallel dataflow-based programming into the .NET Framework.  It also directly integrates with new language support for tasks and asynchrony provided by both C# and Visual Basic, and with existing language support in .NET 4 for tasks provided by F#.

Continue reading “TPL Dataflow”

AForge.Net

Definitely something to play with when my brain has more oxygen (suffering from anaemia right now)

Deneme 1 2 3!

AForge.NET  is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence – image processing, neural networks, genetic algorithms, machine learning, robotics, etc.

  • AForge.Imaging – library with image processing routines and filters;
  • AForge.Vision – computer vision library;
  • AForge.Neuro – neural networks computation library;
  • AForge.Genetic – evolution programming library;
  • AForge.Fuzzy – fuzzy computations library;
  • AForge.MachineLearning – machine learning library;
  • AForge.Robotics – library providing support of some robotics kits;
  • AForge.Video – set of libraries for video processing
  • etc.

are mainly topic in this framework. İf you want more information and help  visit: https://code.google.com/p/aforge/

View original post

Microservices and Docker containers: Architecture, Patterns and Development guidance

As part of the series of posts announced at this initial blog post (.NET Application Architecture Guidance) that explores each of the architecture areas currently covered by our team, this current blog post focuses on “Microservices and Docker containers: Architecture, Patterns and Development guidance”.

Just as a reminder, the four introductory blog posts of this series will be the following:

The microservices architecture is emerging as an important approach for distributed mission-critical applications. In a microservice-based architecture, the application is built on a collection of services that can be developed, tested, deployed, and versioned independently. In addition, enterprises are increasingly realizing cost savings, solving deployment problems, and improving DevOps and production operations by using containers (Docker engine based as de facto standard).

Microsoft has been releasing container innovations for Windows and Linux by creating products like Azure Container Service and Azure Service Fabric, and by partnering with industry leaders like Docker, Mesosphere, and Kubernetes. These products deliver container solutions that help companies build and deploy applications at cloud speed and scale, whatever their choice of platform or tools…

https://blogs.msdn.microsoft.com/dotnet/2017/08/02/microservices-and-docker-containers-architecture-patterns-and-development-guidance/

Cache Consideration in Multi-Threaded Code

In parallel programs is very important to regard cache size and hit rates on a single CPU, but it’s even more important to consider how the caches of multiple processors/cores interact. Let’s consider a single representative example, which demonstrates the important cache optimisation and emphasizes the value of good tools when it comes to performance optimisation in general.

Let’s first examine the first sequential method, it performs the rudimentary task of summing all the elements in a two-dimensional array of integers and returns the result:

public static int MatrixSumSequential(int [,] matrix)
{
    int sum = 0;
    int rows = matrix.GetUpperBound(0);
    int cols = matrix.GetUpperBound(1);
    for(int i = 0; i < rows; i++)
    {
        for(int j = 0; j < cols; j++)
        {
            sum += matrix[i, j];
        }
    }
    return sum;  
}

We could have used TPL but let’s ignore the huge arsenal of tools TPL provides in our simple example. The following attempt at parallelisation may appear sufficiently reasonable to harvest the fruits of multi-core execution, and even implements a crude aggregation to avoid synchronisation on the shared sum variable:

public static int MatrixSumParallel(int [,] matrix)
{
    int sum = 0;
    int rows = matrix.GetUpperBound(0);
    int cols = matrix.GetUpperBound(1);
    const int THREADS = 4;
    int chunk = row / THREADS;
    int [] localSums = new int[THREADS];
    Threads [] threads = new Threads[THREADS];
    for(int i = = 0; i < THREADS; i++)
    {
        int start = chunk * i;
        int end - chunk * (1 + i);
        int threadNum = i;
        threads[i] = new Thread(() => {
            for(int row = start; row < end; r++)
            {
                for(int col = 0; col < cols; col++)
                {
                    localSums[threadNum] += matrix[row, col];
                }
            }
        });
        threads[i].Start();
        foreach(var thread in threads)
            thread.Join();
    }
    return localSums.Sum();
}

 

Executing each of the two methods several times on an i7 machine with 6 cores produced the following results for a 2,000 x 2,000 matrix of integers:

  • 325ms average for sequential method
  • 935ms for the parallel method. Three times as slow as the sequential method!

The obvious question is why?
This is not an example of too fine grained parallelism because the number of threads is only 4. However if you accept the premise that the problem is somehow the cache related, it would make sense to measure the number of cache misses introduced by the 2 methods above.

The Visual Studio profiler when sampling the execution of each method with a 2,000 x 2,000 matrix reported 963 exclusive samples in the parallel version and only 659 exclusive samples in the sequential version, the vast majority of samples being on the inner loop line that reads from the matrix.

Why would a line of code writing to localSums introduce so many cache misses in comparison to writing to sum local variable? The answer is that the writes to the shared array invalidate cache lines at other processors/cores, causing every += operating to be a cache miss.
When the processor writes to a memory location that is in the cache of another processor/core cache, the hardware causes a cache invalidation, that marks the cache line as invalid. Accessing that line results in a cache miss.

The moral of the story do not blindly introduce parallelization in a hope that that would also result in the performance increase. Always test both versions, you might be surprised at the results!

Fraction Implementation in C#

I’m not really sure why Microsoft have never bothere with implementing a Fraction primitive in .NET. I’m sure there are plenty of uses as fraction allow to preserve the maximum possible precision. I have therefore decided to create my own implementation (albeit somewhat primitive at this stage) .

My implementation automatically simplifies the fraction, so if you we to create new Function(6, 3) that would be simplified to 2. The Fraction struct implements all the arithmetic operators on itself and on Int64, float, double and decimal.

Internally the Fraction is represented as two Int64: Numerator and Denominator and is always simplified upon initialisation. I initially intended to have it as an option, however following profiling the cost of simplification is not that great and the benefits outweigh the performance drawbacks. 

Fraction has explicit conversion to Int64 (although that is bound to lose precision), float, double and decimal. It supports comparison with Int64, float, double and decimal and even supports ++ and — operations.

So far I have provided more or less complete implementation with plenty of Unit Test. Now the hard word of optimising the performance begins!

Design of Fractions

Fraction is implemented as a struct (pretty obvious choice). It takes a numerator as the first argument and denominator,  it then tries to simplify the fraction using the Euclidean algorithm, so if you were to specify 333/111 it would become 3.

The implementation supports all arithmetic operations with long, float, double and decimal and can also be converted to those type by either calling the corresponding methods or using explicit cast.

You can also create a function from either a long, float, double or decimal. Conversion from a long is quite trivial however conversion from a float, double or a decimal goes through a while loop and multiplies the floating point number until it has no decimal places. This method is relatively slow and therefore is not recommended.

Apart from that the Fraction behaves like a fist class citizen: you can compare a Fraction to any other number, divide, multiply, add, subtract, compare, increment decrement etc.

For example:

var oneThird = Fraction(1, 3);
var reciprocal = oneThird.Reciprocal();

Console.WriteLine(oneThird * reciprocal) : "1"
Console.WriteLine(++oneThird) : "4/3" - just like with an integer ++ adds 1 
Console.WriteLine(oneThird * oneThird) : "1/9"

 

Please feel free to contribute to the codebase if you feel like it

https://github.com/ebalynn/Balynn.Maths.Fraction

ConditionalWeakTable – Weak Dictionary

If you are about to begin implementing your own version of a thread safe generic weak dictionary – STOP!
As of .NET 4.0 there is already a class that implements that functionality and it’s called ConditionalWeakTable and it exists in System.Runtime.CompilerServices namespace.

There are however several limitations: both the key and the value have to be reference types (TKey : class and TValue : class).

Here is the comment from the source file:

** Description: Compiler support for runtime-generated "object fields."
**
** Lets DLR and other language compilers expose the ability to
** attach arbitrary "properties" to instanced managed objects at runtime.
**
** We expose this support as a dictionary whose keys are the
** instanced objects and the values are the "properties."
**
** Unlike a regular dictionary, ConditionalWeakTables will not
** keep keys alive.
**
**
** Lifetimes of keys and values:
**
** Inserting a key and value into the dictonary will not
** prevent the key from dying, even if the key is strongly reachable
** from the value.
**
** Prior to ConditionalWeakTable, the CLR did not expose
** the functionality needed to implement this guarantee.
**
** Once the key dies, the dictionary automatically removes
** the key/value entry.
**
**
** Relationship between ConditionalWeakTable and Dictionary:
**
** ConditionalWeakTable mirrors the form and functionality
** of the IDictionary interface for the sake of api consistency.
**
** Unlike Dictionary, ConditionalWeakTable is fully thread-safe
** and requires no additional locking to be done by callers.
**
** ConditionalWeakTable defines equality as Object.ReferenceEquals().
** ConditionalWeakTable does not invoke GetHashCode() overrides.
**
** It is not intended to be a general purpose collection
** and it does not formally implement IDictionary or
** expose the full public surface area.
**
**
**
** Thread safety guarantees:
**
** ConditionalWeakTable is fully thread-safe and requires no
** additional locking to be done by callers.
**
**
** OOM guarantees:
**
** Will not corrupt unmanaged handle table on OOM. No guarantees
** about managed weak table consistency. Native handles reclamation
** may be delayed until appdomain shutdown.

Just by looking at the comments alone we can extrapolate that what we have is equivalent to a generic, thread safe weak dictionary! Further internet research, confirms the findings.
There are several critical limitations though:
• TKey and TValue both have to be reference types
• The equality is defined using ReferenceEquals()
• GetHashCode() overrides are never called

MethodImplOptions.AggressiveInlining

The JIT compiler logically determines which methods to inline. But sometimes we know better than it does. With AggressiveInlining, we give the compiler a hint. We tell it that the method should be inlined. Actually, the only hint we give the compiler is to ignore the size restriction on the method or the property you want to inline. Using this attribute does not guarantee that the method will be inlined. There are 1000 and 1 reasons why it cannot be (being virtual for one thing)

Example

This example benchmarks a method with no attribute, and with AggressiveInlining. The method body contains several lines of useless code. This makes the method large in bytes, so the JIT compiler may decide not to inline it.

And: We apply the MethodImplOptions.AggressiveInlining option to Method2. It is an enum.

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
 
class Program
{
    const int _max = 10000000;
    static void Main()
    {
        // ... Compile the methods
        Method1();
        Method2();
        int sum = 0;
 
        var s1 = Stopwatch.StartNew();
        for (int i = 0; i &lt; _max; i++)
        {
            sum += Method1();
        }
        s1.Stop();
        var s2 = Stopwatch.StartNew();
        for (int i = 0; i &lt; _max; i++)
        {
          sum += Method2();
        }
        s2.Stop();
        
        Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
    
        Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000000) / _max).ToString("0.00 ns"));
        Console.Read();
    }
 
    static int Method1()
    {
        // ... No inlining suggestion
        return "one".Length + "two".Length + "three".Length +
            "four".Length + "five".Length +   "six".Length +
            "seven".Length + "eight".Length + "nine".Length +
            "ten".Length;
    }
 
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    static int Method2()
    {
        // ... Aggressive inlining
        return "one".Length + "two".Length + "three".Length +
            "four".Length + "five".Length + "six".Length +
            "seven".Length + "eight".Length + "nine".Length +
            "ten".Length;
    }
}

Output

7.34 ns No options
0.32 ns MethodImplOptions.AggressiveInlining

We see that with no options, the method calls required seven nanoseconds each. But with inlining specified (with AggressiveInlining), the calls required less than one nanosecond each.

Tip 1: Consider for a moment all the things you could do with those seven nanoseconds.

Tip 2: If you are scheduling your life based on nanoseconds, please consider reducing your coffee intake.

Performance Drawbacks of Destructors (Finalizers) in .NET

Every decent .NET developer understands why we need destructors and how they work. In short once you create a method ‘~[ClassName]’ that particular object is placed into finalization queue. Normally those classes that use destructors do also implement the IDisposable interface, and the destructor is there only as a safety net – i.e. in case the developer forgets to call implicitly or explicitly ‘Dispose()’.

However, even though it is clearly stated, not every developer realizes the implications of having to put that object into finalization queue. The popular trail of thought is that if ‘GC.SuppressFinalize()’ there would be no performance penalty, at the end of the day we are removing the object from the finalisation queue and the GC has less work to do.
What many do not realise is that there is only one finalisation queue and if working with multiple threads .NET would have to synchronise the access to that queue somehow. So even when creating objects with destructors from multiple threads, inadvertently we create contention to the finalisation queue and even the creation of such objects could potentially be slower.

Blog at WordPress.com.

Up ↑