The Library Help/Info Current Release
| |
Last Modified: Jan 04, 2013
|
| How to Contribute
There are some simple ways to contribute to dlib:
- Find confusing or incorrect documentation
- Help make the web page prettier
- Link to dlib from your web page
- Add yourself or your project to the list of
dlib users
- Try to run the dlib regression test suite on any platforms you
have access to
Contributing Code
Code contributions are also welcome. I use Mercurial
for version control. So the simplest
way to contribute code is to clone from the dlib repository, make your changes, and then email
them to me as a bundle file. An example is shown below.
hg clone http://hg.code.sf.net/p/dclib/code dclib-code
cd dclib-code
[make changes to dlib source code]
hg commit -u "Your Name <your.email@host.com>"
hg bundle dlibchanges.hg
Then email me the dlibchanges.hg file and I'll review it and get back to you.
If you want to make a big change or feature addition, it's probably a good idea to talk to me about it first.
Additionally, you should read over the coding guidelines below
and try to follow them. It is also probably a good idea to read the books Effective C++ and
More Effective C++ by Scott Myers. Finally, if you are not familiar with Mercurial you also
might want to read the excellent introduction by Joel Spolsky.
Coding Guidelines
1. Use Design by Contract
2. Use spaces instead of tabs.
3. Use the standard C++ naming convention
4. Use RAII
5. Don't use pointers
6. Don't use #define for constants.
7. Don't use stack based arrays.
8. Use exceptions, but don't abuse them
9. Write portable code
10. Setup regression tests
11. Use the Boost Software License
Apply Design by Contract to Your Code
The most important part of a software library isn't the code, it is the set
of interfaces the library exposes to the user. These interfaces need to be easy
to use right, and hard to use wrong. The only way this
happens is if the interfaces are documented in a simple, consistent, and precise way.
The name for the way I design and document these interfaces is known as
Design by Contract. There is a lot that can be said about Design by Contract, in fact,
whole books have been written about it, and programming languages exist which
use Design by Contract as a central element. Here I will just go over some
of the basic ways it is used in dlib as well as some of the reasons why it is a
Good Thing.
- Functions should have documented preconditions which are programmatically verifiable
Many functions have a set of requirements or preconditions that need to be satisfied
if they are to be used. If these requirements are not satisfied
when a function is called then the function will not do what it is supposed to do. Moreover,
any piece of software that calls a function but doesn't make sure all preconditions
are satisfied contains a bug, by definition.
This means all functions must precisely document their preconditions if they are to be
usable. In fact, all preconditions should be programmatically verifiable. Doing this
has a number of benefits. First, it means they are unambiguous. English
can be confusing and vague, but saying "some_predicate == true" uses a
formal language, C++, that we all should understand quite well. Second, it means
you can put checks into the code that will catch all usage errors.
These checks should always be implemented using
DLIB_ASSERT or
DLIB_CASSERT and they should always
cover all preconditions.
These macros take a boolean argument and if it is false they throw dlib::fatal_error. So
you can use them to check that all your preconditions are true. Also, don't forget that
a violated function precondition indicates a bug in a program.
That is, when dlib::fatal_error is thrown it means a bug has been found and the only thing
an application can do at that point is print an error message and terminate.
In fact, dlib::fatal_error has checks in it to make sure someone doesn't catch the
exception and ignore it. These checks will abruptly terminate any program that attempts
to ignore fatal errors.
The above considerations bring me to my next bit of advice. Developers new to Design by Contract
often confuse input validation requirements with function preconditions. When I tell them
to consider any violation of a function's preconditions a bug and terminate their application
with an error message they complain that this is not at all what an application should do when
it receives invalid user inputs.
They are right, that would be a bad thing
and you should not write software that behaves that way. The way out of this problem is, of
course, to not consider invalid input as a bug. Instead, you should perform explicit input validation
on any
data coming into your program before it gets to any functions that have preconditions
which demand the validated inputs. Moreover, if you make your preconditions programmatically verifiable
then it should be easy to validate any inputs by simply using whatever it is you
use to check your preconditions.
Consider the function cross_validate_trainer as an
example. One of its requirements is that the input forms a valid binary classification problem.
This is documented in the list of preconditions as
"is_binary_classification_problem(x,y) == true". This precondition is just saying
that when you call
the is_binary_classification_problem
function on the x and y inputs it had better return true
if you want to use those inputs with the cross_validate_trainer function.
Given this information it is trivial to perform input validation. All you have to do is
call is_binary_classification_problem on your input data and you are done.
Using the above technique you have validated your inputs, documented your preconditions, and are
buffered by DLIB_ASSERT statements that will catch you if you accidentally forget to validate any
inputs.
The thing to understand here is that
a violation of a function's preconditions means you have a bug on your hands. Or in other words,
you should never intentionally violate any function preconditions. But of course
it will happen from time to time because bugs are unavoidable. But at least with
this approach you will get a detailed error message early in development rather than a
mysterious segmentation fault days or weeks later.
- Functions should have documented postconditions
I don't have nearly as much to say about postconditions as I did about function requirements. You should
strive to write programmatically verifiable postconditions because that makes your postconditions
more precise. However, it is sometimes the case that this isn't practical and that is fine.
But whatever you do write needs to clearly communicate to the
user what it is your function does.
Now you may be wondering why this is called Design by Contract and not Documentation
by Contract. The reason is that the process of writing down all these detailed descriptions
of what your code does becomes part of how you design software. For example, often you
will find that when you go to write down the requirements for calling a function you are unable
to do so. This may be because the requirements are so complex you can't think of a way
to describe them, or you may realize that you don't even know what they are. Alternatively,
you may know what they are but discover there isn't any way to verify them programmatically. All these
things are symptoms of a bad design and the reason you became aware of this design problem
was by attempting to apply Design by Contract.
After you get enough practice with this way of writing software you begin to think a lot
more about questions like "how can I design this class such that every member function
has a very simple set of requirements and postconditions?" Once you start doing this
you are well on your way to creating software components that are easy to use right, and
hard to use wrong.
The notation dlib uses to document preconditions and postconditions is described in
the introduction. All code that goes into dlib
must document itself using this notation. You should also separate the implementation
and specification of a component into two separate files as described in the introduction. This
way users aren't confused or distracted by implementation details when they look at the documentation.
Use spaces instead of tabs. This is just generally good advice but
it is especially important in dlib since everything is viewable
as pretty-printed HTML. Tabs show up as 8 characters in most browsers
and this results in the HTML version being difficult to read. So
don't use tabs. Additionally, please use 4 spaces for each tab level.
Don't use capitol letters in the names of variables, functions, or
classes. Use the _ character to separate words.
The reason dlib uses this style is because it is the style used by the
C++ standard library. But more importantly, dlib currently provides
an interface to users that has a consistent look and feel and it is
important to continue to do so.
As for constants, they should usually contain all upper case letters
but all lowercase is ok sometimes.
Don't use manual resource management. Use RAII
instead.
You should not be calling new and delete in your own code. You should instead
be using objects like the std::vector, scoped_ptr,
or any number of other objects that manage resources such as memory for you. If you want
an array use std::vector (or the checked std_vector_c).
If you want to make a lookup table use a map. If you want
a two dimensional array use matrix or
array2d.
These container objects are examples of what is called RAII (Resource Acquisition Is Initialization)
in C++. It is essentially a name for the fact that, in C++, you can have totally automated and
deterministic resource management by always associating resource acquisition with the construction
of an object and resource release with the destruction of an object. I say resource management
here rather than memory management
because, unlike Java, RAII can be used for more than memory management. For example, when
you use a mutex you first lock
it, do something, and then you need to remember to unlock it. The RAII way of doing this is
to use the auto_mutex which will lock a mutex and automatically
unlock it for you. Or suppose you have made a TCP connection
to another machine and you want to be certain the resources associated with that connection
are always released. You can easily accomplish this with RAII by using the scoped_ptr as
shown in this example program.
RAII is a trivial technique to use. All you have to do is not call new and delete and
you will never have another memory leak. Just use the appropriate container
instead. Finally, if you don't use RAII then your code is almost certainly not exception safe.
Don't use pointers
There are a number of reasons to not use pointers. First, if you are using pointers then
you are probably not using RAII. Second, pointers are ambiguous. When I see a pointer
I don't know if it is a pointer to a single item, a pointer to nothing, or
a pointer to an array of who knows how many things. On the other hand, when I see a
std::vector I know with certainty that I'm dealing with a kind of array. Or if I see a
reference to something then I know I'm dealing with exactly one instance of some object.
Most importantly, it is impossible to validate the state of a pointer. Consider two
functions:
double compute_sum_of_array_elements(const double* array, int array_size);
double compute_sum_of_array_elements(const std::vector<double>& array);
The first function is inherently unsafe. If the user accidentally passes in an invalid pointer
or sets the size argument incorrectly then their program may crash and this will turn into a
potentially hard to find bug. This is because there is absolutely nothing you can do inside
the first function to tell the difference between a valid pointer and size pair and an invalid
pointer and size pair. Nothing. The second function has none of these difficulties.
If you absolutely need pointer semantics then you can usually use a smart pointer like
scoped_ptr or shared_ptr.
If that still isn't good enough for you and you really need to use a normal C style pointer
then isolate your pointers inside a class so that they are contained in a small area of the code.
However, in practice the container classes in dlib and the STL are more than sufficient in nearly
every case where pointers would otherwise be used.
Don't use #define for constants.
dlib is meant to be integrated into other people's projects. Because of this everything
in dlib is contained inside the dlib namespace to avoid naming conflicts with user's code.
#defines don't respect namespaces at all. For example, if you #define a constant called SIZE then it
will cause a conflict with any piece of code anywhere that contains the identifier SIZE.
This means that #define based constants must be avoided and constants should be created using the
const keyword instead.
Don't use stack based arrays.
A stack based array, or C style array, is an array declared like this:
int array[200];
Most of my criticisms of pointers also apply to stack based arrays. In particular,
if you are passing a stack based array to a function then that means you are probably
using functions similar to the unsafe compute_sum_of_array_elements() example above.
The only time it is OK to use this kind of array is when you use it for simple
tasks and you don't start passing pointers to the array to other parts of your code. You
should also use a constant to store the array size and use that constant in your loops
rather than hard coding the size in numerous places.
But even still, you should use a container class instead and preferably one with the ability to do range
checking such as the std_vector_c.
Consider the following two bits of code:
for (int i = 0; i < array_size; ++i)
my_c_array[i] = 4;
for (int i = 0; i < my_std_vector.size(); ++i)
my_std_vector[i] = 4;
The second loop clearly doesn't overflow the bounds of the my_std_vector. On the other
hand, just by looking at the code in the first loop, we can not tell if it overflows
my_c_array. We have to assume that array_size is the appropriate constant but we could be wrong.
Buffer overflows are probably the most common kind of bug in C and C++ code. These bugs also
lead to serious exploitable security holes in software. So please try to avoid stack based arrays.
Use exceptions, but don't abuse them.
Exceptions are one of the great features of modern programming languages. Some
people, however, consider that to be a contentious statement. But if you accept
the notion that a software library should be hard to use wrong then it
becomes difficult to reject exceptions.
Most of the complaints I hear about exceptions are actually complaints
about their misuse rather than objections to the basic idea.
So before I begin to defend the above
paragraph I would like to lay out more clearly when it is appropriate to
use exceptions and when it is not.
There are two basic questions you should ask yourself when deciding whether to
throw an exception in response to some event. The first is (1) "should this event
occur in the normal use of my library component?" The second question is (2) "if this event
were to occur, is it likely that the user will want to place the code for dealing
with the event near the invocations of my library component?"
If your answers to the above two questions are "no" then you should probably
throw an exception in response to the event. On the other hand, if you answer
"yes" to either of these questions then you should probably not throw an exception.
A good example of an event worth throwing exceptions for is running out of memory.
(1) It doesn't happen very often, and (2) when it does happen it is hardly ever the case that
you want to deal with the out of memory event right next to the place where you are
attempting to allocate memory.
Alternatively, an example of an event that shouldn't throw an exception comes to
us from the C++ I/O streams. This part of the standard library allows
you to read the contents of a file from disk. When you hit the end of file they
do not throw an exception. This is appropriate because (1) you usually want to
read a file in its entirety. So hitting EOF happens all the time. Additionally, (2)
when you hit EOF you usually want to break out of the loop you are in
and continue immediately into the next block of code.
Usually when someone tells me they don't like exceptions they give reasons like "they make
me put try/catch blocks all over the place and it makes the code hard to read." Or "it makes
it hard to understand the flow of a program with exceptions in it." Invariably they
have been working with bodies of software that disregard the above rules regarding questions
1 and 2. Indeed, when exceptions are used for flow control the results are horrifying. Using
exceptions for events that occur in the normal use of a software component, especially when
the events need to be dealt with near where they happen result in a spaghetti-like mess
of throw statements and try/catch blocks. Clearly, exceptions should be used judiciously.
So please, take my advice regarding questions 1 and 2 to heart.
Now let's go back to my claim that exceptions are an important part of making
a library that is hard to use wrong. But first let's be honest about one thing,
many developers don't think very hard about error handling and they similarly aren't very
careful about checking function return codes. Moreover, even the most studious of
us can easily forget to add these checks. It is also easy to forget to add
appropriate exception catch blocks.
So what is so great about exceptions then? Well, let's imagine some error just occurred
and it caused an exception to be thrown. If you forgot to setup catch blocks to deal with
the error then your program will be aborted. Not exactly a great thing. But you will, however,
be able to easily find out what exception was thrown. Additionally, exceptions typically contain a
message telling you all about the error. Moreover,
any debugger worth its
salt will be able to show you a stack trace that let's you see exactly where the exception came from.
The exception forces you, the user, to
be aware of this potential error and to add a catch block to deal with it.
This is where the "hard to use wrong" comes from.
Now let's imagine that we are using return codes to communicate errors to the user and the
same error occurs. If you forgot to do all your return code checking then you will
simply be unaware of the error. Maybe your program will crash right away. But more likely, it
will continue to run for a while before crashing at some random place far away from the source
of the error. You and your debugger now get to spend a few hours of quality time
together trying to figure out what went wrong.
The above considerations are why I maintain that exceptions, when used properly, contribute to
the "hard to use wrong" factor of a library. There are also other reasons to use exceptions.
They free the user from needing to clutter up code with lots of return code checking. This makes
code easier to read and let's you focus more on the algorithm you are trying to implement and less
on the bookkeeping.
Finally, it is important to note that there is a place for return codes. When you answer "no"
to questions 1 and 2, I suggest using exceptions. However, if you answer "yes" to even one
of them then I would recommend pretty much anything other than throwing an exception. In this
case error codes are often an excellent idea.
As an aside, it is also important that your exception classes inherit from
dlib::error to maintain consistency with the rest of the library.
Write portable code- Don't complicate the build process
One of dlib's design goals is to not require any installation
process before it can be used. A user should be able to copy
the dlib folder into their project and have it just work.
In particular, using dlib in a project should not make it difficult to
compile the project from the command line. For example, all the
example programs provided with dlib can be compiled using a single
statement on the command line.
Similarly, the user should be able to check the dlib folder into whatever
version control system they use without running into any difficulties.
The user should then be able to check out copies of the code on any
of the dlib supported platforms and have their project build without
needing to mess with anything.
It is also important to know that dlib is mostly a header-only library.
This is primarily a result of the heavy use of C++ templates. Because of this,
in many cases, all that is needed to use the library is to
add dlib into a compiler's include search path. However, the most important
reason you need to know this is so you don't hassle me about providing a
"shared library" version of dlib. :)
This point deserves some explaining. When you write a piece of
software that links against a shared library you need two things. First,
you need the shared library object files and second you need the header files
for the library itself so you can #include them in your application. However,
since nearly all the code in dlib is in the header files there isn't
any point in distributing it as a shared library.
There are also a lot of technical problems with C++ shared libraries in general.
You can read about shared libraries on
this page
for more details. If you still think C++ template libraries like dlib should be built as
shared libraries then you should refer yourself to the following documents which
have been submitted to the C++ standards committee:
N2316: Modules in C++,
N1496: Draft Proposal for
Dynamic Libraries in C++, and
N2425: Minimal Dynamic Library Support.
- Don't make assumptions about how objects are laid out in memory.
If you have been following the prohibition against messing around with
pointers then this won't even be an issue for you. Moreover, just about the only
time this should even come up is when you are casting blocks of
memory into other types or dumping the contents of memory to an I/O channel.
All of these things are highly non-portable so don't do them.
If you want a portable way to write the state of an object to an
I/O channel then I recommend you use the serialization
capability in dlib. If that doesn't suit your needs then do
something else, but whatever you do don't just dump the contents of memory.
Convert your data into some portable format and then output that.
As an example of something else you might do: suppose you have a bunch of integers
you want to write to disk. Assuming all your integers are positive numbers representable
using 32 or fewer bits you could store all your numbers in
dlib::uint32 variables and then convert them
into either big or little endian byte order and then write them to an output stream.
You could do this using code similar to the following:
dlib::byte_orderer bo;
...
bo.host_to_big(my_uint);
my_out_stream.write((char*)&my_uint, sizeof(my_uint));
...
There are three important things to understand about this process. First, you need
to pick variables that always have the same size on all platforms. This means you
can't use any of the built in C++ types like int, float, double, long, etc. All
of these types have different sizes depending on your platform and even compiler settings.
So you need to use something like dlib::uint32 to obtain a type of a known size.
Second, you need to convert each thing you write out into either big or little endian byte order.
The reason for this is, again, portability. If you don't explicitly convert to one
of these byte orders then you end up writing data out using whatever byte order
is used by your current machine. If you do this then only machines that have the same
byte order as yours will be able to read in your data. If you use the dlib::byte_orderer
object this is easy. It is very type safe. In fact, you should have a hard time even getting
it to compile if you use it wrong.
The third thing you should understand is that you need to write out each of your
variables one at a time. You can't write out an entire struct in a
single ostream.write() statement because the compiler is allowed to put any
kind of padding it feels like between the fields in a struct.
You may be aware that compilers usually provide #pragma directives that allow you
to explicitly control this padding. However, if you want to submit code to dlib
you will not use this feature. Not all compilers support it in the same way and,
more importantly, not all CPU architectures are even capable of running code that
has had the padding messed with. This is because it can result in the CPU attempting
to perform what is called an "unaligned load" which many CPUs (like the SPARC) are
incapable of doing.
So in summary, convert your data into a known type with a fixed size, then convert
into a specific byte order (like big endian), then write out each variable individually.
Or you could just use serialize and not worry about all
this horrible stuff. :)
All code that calls functions that aren't in dlib or the C++
standard library must be isolated inside the API wrappers.
If you want to contribute code to dlib which needs to use something that isn't
in the C++ standard then we need to introduce a new library component
in the API wrappers section. The new component would
provide whatever functionality you need. This new component would have
to provide at least POSIX and win32 implementations.
It is also worth pointing out that simple wrappers around operating system
specific calls are usually a bad solution. This is because there are
invariably subtle, if not huge, differences between what is available on different
operating systems.
So being truly portable takes a lot of work. It involves reading everything
you can find about all the APIs needed to implement the feature on each target platform.
In many cases there will be important details that are undocumented and you will
only be able to find out about them by searching the internet for other developers
complaining about bugs in API functions X, Y, and Z. All this stuff needs to be abstracted
away to put a portable and simple interface in front of it. So this is a task
that shouldn't be taken lightly.
Library components should have regression tests
dlib has a regression test suite located in
the dlib/test folder. Whenever possible, library components should have tests
associated with them. GUI components get a pass since it isn't very easy to setup
automatic tests for them but pretty much everything else should have some sort
of test.
You must use the Boost Software License
Having the library use more than one open source license is confusing
so I ask that any code contributions be licensed under the Boost Software
License.
|