%
% $Id: prof.tex,v 1.4 1994/03/05 00:25:02 otto Exp $
%
% $Log: prof.tex,v $
% Revision 1.4  1994/03/05  00:25:02  otto
% *** empty log message ***
%
% Revision 1.6  1994/02/28  17:00:12  jim
% Final version ? Added new function MPI_PCONTROL and description of
% logic for this. Updated example to use MPI_Wtime.
%
% Revision 1.1  1994/02/28  16:56:15  jim
% Initial revision
%
% Revision 1.5  1994/02/02  10:21:37  jim
% Added extra requirement for layered implementations.
% Added section on multiple counting and linker oddities.
% Corrected examples to reflect current bindings.
%
% Revision 1.4  1993/05/05  15:34:04  jim
% Changed the prefix to PMPI_ and removed other options.
% Added discussion on multiple levels of interception.
%
% Revision 1.3  1993/03/18  14:42:51  jim
% Changed order so that requirements come before discussion.
% Some other tidying up.
%
% Revision 1.2  1993/02/04  14:59:12  jim
% Minor changes, added to do section.
%
% Revision 1.1  1993/02/02  10:38:37  jim
% Initial revision
%
%
\chapter{Profiling Interface}
\label{sec:prof}
\label{chap:prof}
%\footnotetext[1]{Version as of March, 1994}

\section{Requirements}
To meet the \MPI/ profiling interface, an implementation of the \MPI/
functions {\em must}
\begin{enumerate}
\item  
provide a mechanism through which all of the \MPI/ defined functions may
be accessed with a name shift. Thus all of the \MPI/ functions (which
normally start with the prefix ``{\tt MPI\_}'') should also be
accessible with the prefix ``{\tt PMPI\_}''.

\item  
ensure that those \MPI/ functions which are not replaced may still be
linked into an executable image without causing name clashes.

\item 
document the implementation of different language bindings of the \MPI/
interface if they are layered on top of each other, so that the
profiler developer knows whether she must implement the profile
interface for each binding, or can economise by implementing it only
for the lowest level routines.

\item
where the implementation of different language bindings is is done
through a layered approach (e.g. the Fortran binding is a set of
``wrapper'' functions which call the C implementation), ensure that
these wrapper functions are separable from the rest of the library.

This is necessary to allow a separate profiling library to be
correctly implemented, since (at least with Unix linker semantics) the
profiling library must contain these wrapper functions if it is to
perform as expected. This requirement allows the person who builds the
profiling library to extract these functions from the original \MPI/
library and add them into the profiling library without bringing along
any other unnecessary code.

\item 
provide a no-op routine \func{MPI\_PCONTROL} in the \MPI/ library.
\end{enumerate}

\section{Discussion}
The objective of the \MPI/ profiling interface is to ensure that it is
relatively easy for authors of profiling (and other similar) tools to
interface their codes to \MPI/ implementations on different machines.

Since \MPI/ is a machine independent standard with many different
implementations, it is unreasonable to expect that the authors of
profiling tools for \MPI/ will have access to the source code which
implements \MPI/ on any particular machine. It is therefore necessary to
provide a mechanism by which the implementors of such tools can
collect whatever performance information they wish {\em without}
access to the underlying implementation.

We believe that having such an interface is important if \MPI/ is to be
attractive to end users, since the availability of many different
tools will be a significant factor in attracting users to the \MPI/
standard.

The profiling interface is just that, an interface. It says {\em
nothing} about the way in which it is used. There is therefore no
attempt to lay down what information is collected through the
interface, or how the collected information is saved, filtered, or
displayed.

While the initial impetus for the development of this interface arose
from the desire to permit the implementation of profiling tools, it is
clear that an interface like that specified may also prove useful for
other purposes, such as ``internetworking'' multiple \MPI/
implementations. Since all that is defined is an interface, there is
no objection to its being used wherever it is useful.

As the issues being addressed here are intimately tied up with the way
in which executable images are built, which may differ greatly on
different machines, the examples given below should be treated solely
as one way of implementing the objective of the \MPI/ profiling
interface. The actual requirements made of an implementation are those
detailed in the Requirements section above, the whole of the rest of
this chapter is only present as justification and discussion of the
logic for those requirements.

The examples below show one way in which an implementation could be
constructed to meet the requirements on a Unix system (there are
doubtless others which would be equally valid).

\section{Logic of the design}

Provided that an \MPI/ implementation meets the requirements above, it
is possible for the implementor of the profiling system to intercept
all of the \MPI/ calls which are made by the user program. She can then
collect whatever information she requires before calling the
underlying \MPI/ implementation (through its name shifted entry points)
to achieve the desired effects.

\subsection{Miscellaneous control of profiling}

There is a clear requirement for the user code to be able to control
the profiler dynamically at run time. This is normally used for (at
least) the purposes of
\begin{itemize}
\item
Enabling and disabling profiling depending on the state of the
calculation.
\item
Flushing trace buffers at non-critical points in the calculation
\item
Adding user events to a trace file.
\end{itemize}

These requirements are met by use of the \func{MPI\_PCONTROL}.

\begin{funcdef}{MPI\_PCONTROL(level, \ldots)}
\funcarg{\IN}{level}{Profiling level}
\end{funcdef}
\mpibind{MPI\_Pcontrol(const~int~level, \ldots)}
\mpifbind{MPI\_PCONTROL(level)\fargs INTEGER LEVEL, \ldots}

\MPI/ libraries themselves make no use of this routine, and simply
return immediately to the user code. However the presence of calls to
this routine allows a profiling package to be explicitly called by the
user.

Since \MPI/ has no control of the implementation of the profiling code,
we are unable to specify precisely the semantics which will be
provided by calls to \func{MPI\_PCONTROL}. This vagueness extends to the 
number of arguments to the function, and their datatypes.

However to provide some level of portability of user codes to different
profiling libraries, we request the following meanings for certain
values of level.
\begin{itemize}
\item{{\tt level==0}}
Profiling is disabled.
\item{{\tt level==1}}
Profiling is enabled at a normal default level of detail.
\item{{\tt level==2}}
Profile buffers are flushed. (This may be a no-op in some profilers).
\item{All other values of {\tt level}}
have profile library defined effects and additional arguments.
\end{itemize}

We also request that the default state after \func{MPI\_INIT} has been
called is for profiling to be enabled at the normal default level.
(i.e. as if \func{MPI\_PCONTROL} had just been called with the
argument 1). This allows users to link with a profiling library and
obtain profile output without having to modify their source code at
all.

The provision of \func{MPI\_PCONTROL} as a no-op in the standard \MPI/
library allows them to modify their source code to obtain more
detailed profiling information, but still be able to link exactly the
same code against the standard \MPI/ library.


\section{Examples}

\subsection{Profiler implementation}

Suppose that the profiler wishes to accumulate the total amount of
data sent by the MPI\_SEND function, along with the total elapsed time
spent in the function. This could trivially be achieved thus

\snir
\begin{verbatim}
static int totalBytes;
static double totalTime;

int MPI_SEND(void * buffer, const int count, MPI_Datatype datatype,
             int dest, int tag, MPI_comm comm)
{
   double tstart = MPI_Wtime();       /* Pass on all the arguments */
   int extent;
   int result    = PMPI_Send(buffer,count,datatype,dest,tag,comm);   

   MPI_Type_size(datatype, &extent);  /* Compute size */
   totalBytes += count*extent;

   totalTime  += MPI_Wtime() - tstart;         /* and time          */

   return result;                       
}
\end{verbatim}  
\rins   
        
\subsection{MPI library implementation}
On a Unix system, in which the \MPI/ library is implemented in C, then
there are various possible options, of which two of the most obvious
are presented here. Which is better depends on whether the linker and
compiler support weak symbols.

\subsubsection{Systems with weak symbols}
If the compiler and linker support weak external symbols (e.g. Solaris
2.x, other system V.4 machines), then only a single library is
required through the use of {\tt \#pragma weak} thus 

\begin{verbatim}
#pragma weak MPI_Example = PMPI_Example

int PMPI_Example(/* appropriate args */)
{
    /* Useful content */        
}
\end{verbatim}

The effect of this {\tt \#pragma} is to define the external symbol {\tt
MPI\_Example} as a weak definition. This means that the linker will
not complain if there is another definition of the symbol (for
instance in the profiling library), however if no other definition
exists, then the linker will use the weak definition. 

\subsubsection{Systems without weak symbols}
In the absence of weak symbols then one possible solution would be to
use the C macro pre-processor thus 

\begin{verbatim}
#ifdef PROFILELIB
#    ifdef __STDC__
#        define FUNCTION(name) P##name
#    else
#        define FUNCTION(name) P/**/name
#    endif
#else
#    define FUNCTION(name) name
#endif
\end{verbatim}

Each of the user visible functions in the library would then be
declared thus

\begin{verbatim}
int FUNCTION(MPI_Example)(/* appropriate args */)
{
    /* Useful content */        
}
\end{verbatim}

The same source file can then be compiled to produce both versions of
the library, depending on the state of the {\tt PROFILELIB} macro
symbol.

It is required that the standard \MPI/ library be built in such a way
that the inclusion of \MPI/ functions can be achieved one at a time.
This is a somewhat unpleasant requirement, since it may mean that
each external function has to be compiled from a separate file.
However this is necessary so that the author of the profiling library
need only define those \MPI/ functions which she wishes to intercept,
references to any others being fulfilled by the normal \MPI/ library.
Therefore the link step can look something like this 

\begin{verbatim}
% cc ... -lmyprof -lpmpi -lmpi
\end{verbatim}

Here {\tt libmyprof.a} contains the profiler functions which intercept
some of the \MPI/ functions. {\tt libpmpi.a} contains the ``name
shifted'' \MPI/ functions, and {\tt libmpi.a} contains the normal
definitions of the \MPI/ functions. 

\subsection{Complications}
\subsubsection{Multiple counting}
Since parts of the \MPI/ library may themselves be implemented using
more basic \MPI/ functions (e.g. a portable implementation of the
collective operations implemented using point to point communications),
there is potential for profiling functions to be called from within an
\MPI/ function which was called from a profiling function. This could
lead to ``double counting'' of the time spent in the inner routine.
Since this effect could actually be useful under some circumstances
(e.g. it might allow one to answer the question ``How much time is
spent in the point to point routines when they're called from
collective functions ?''), we have decided not to enforce any
restrictions on the author of the \MPI/ library which would overcome
this. Therefore the author of the profiling library should be aware of
this problem, and guard against it herself. In a single threaded
world this is easily achieved through use of a static variable in the
profiling code which remembers if you are already inside a profiling
routine. It becomes more complex in a multi-threaded environment (as
does the meaning of the times recorded !)

\subsubsection{Linker oddities}
The Unix linker traditionally operates in one pass : the effect of this
is that functions from libraries are only included in the image if
they are needed at the time the library is scanned. When combined with
weak symbols, or multiple definitions of the same function, this can
cause odd (and unexpected) effects. 

Consider, for instance, an implementation of \MPI/ in which the Fortran
binding is achieved by using wrapper functions on top of the C
implementation. The author of the profile library then assumes that it
is reasonable only to provide profile functions for the C binding,
since Fortran will eventually call these, and the cost of the wrappers
is assumed to be small. However, if the wrapper functions are not in
the profiling library, then none of the profiled entry points will be
undefined when the profiling library is called. Therefore none of the
profiling code will be included in the image. When the standard \MPI/
library is scanned, the Fortran wrappers will be resolved, and will
also pull in the base versions of the \MPI/ functions. The overall
effect is that the code will link successfully, but will not be
profiled.

To overcome this we must ensure that the Fortran wrapper functions are
included in the profiling version of the library. We ensure that this
is possible by requiring that these be separable from the rest of the
base \MPI/ library. This allows them to be {\tt ar}ed out of the base
library and into the profiling one.

\section{Multiple levels of interception}
The scheme given here does not directly support the nesting of
profiling functions, since it provides only a single alternative name
for each \MPI/ function. Consideration was given to an implementation
which would allow multiple levels of call interception, however we
were unable to construct an implementation of this which did not
have the following disadvantages
\begin{itemize}
\item assuming a particular implementation language.
\item imposing a run time cost even when no profiling was taking place.
\end{itemize}
Since one of the objectives of \MPI/ is to permit efficient, low latency
implementations, and it is not the business of a standard to require a
particular implementation language, we decided to accept the scheme
outlined above.

Note, however, that it is possible to use the scheme above to
implement a multi-level system, since the function called by the user
may call many different profiling functions before calling the
underlying \MPI/ function.

Unfortunately such an implementation may require more cooperation
between the different profiling libraries than is required for the
single level implementation detailed above. 







.