Workshop o mikrokontrolérech na SKSP 2024.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

321 lines
12 KiB

2 months ago
Sorting
=======
A very common need is sorting data. Therefore libUCW contains few
routines to accomplish that task. They are much more universal than
qsort(), since they allow you to sort structures indexed by a macro,
sort data externally, if they do not fit into memory, merge data with
the same keys and sort data of variable length.
All routines described below are <<generic:,generic algorithms>>.
- <<array-simple,Simple array sorting>>
* <<mandatory-simple,Mandatory macros>>
* <<optional-simple,Optional macros>>
* <<example-simple,Example>>
- <<array,Huge array sorting>>
* <<mandatory-array,Mandatory macros>>
* <<optional-array,Optional macros>>
- <<external,External sorting>>
* <<basic-external,Basic macros>>
* <<callback-external,Callbacks>>
* <<integer-external,Integer sorting>>
* <<hash-external,Hashing>>
* <<merge-external,Merging>>
* <<input-external,Input>>
* <<output-external,Output>>
* <<other-external,Other switches>>
* <<function-external,Generated function>>
[[array-simple]]
Simple array sorting
--------------------
If you want to sort some data in memory and you aren't too picky about
setting how, you just use the routine defined in
`sorter/array-simple.h`. It is an optimised hybrid
quick-sort/insert-sort algorithm (quick-sort is used to split the
input into small parts, each is then sorted by insert-sort). It is
more than 2 times faster than stdlib's qsort(), mostly because of
inlining.
You need to define few macros and include the header. You get a
sorting function in return. It will be called
<<fun__GENERIC_LINK_|ASORT_PREFIX|sort|,`ASORT_PREFIX(sort)`>>.
[[mandatory-simple]]
Mandatory macros
~~~~~~~~~~~~~~~~
- `ASORT_PREFIX(name)` -- The identifier generating macro.
- `ASORT_KEY_TYPE` -- Data type of a single array entry key.
[[optional-simple]]
Optional macros
~~~~~~~~~~~~~~~
- `ASORT_ELT(i)` -- Indexing macro. Returns the key of the
corresponding entry. If not provided, usual array with sequential
indexing is assumed.
- `ASORT_LT(x,y)` -- Comparing macro. If not provided, compares by the
`<` operator.
- `ASORT_SWAP(i,j)` -- Swap elements with indices `i` and `j`. If not
provided, it assumes `ASORT_ELT` is l-value and it just swaps keys.
- `ASORT_THRESHOLD` -- Sequences of at least this amount of elements are
sorted by quick-sort, smaller are sorted by insert-sort. Defaults to
`8` (result of experimentation).
- `ASORT_EXTRA_ARGS` -- Pass some extra arguments to the function.
They are visible from all the macros. Must start with a comma.
!!ucw/sorter/array-simple.h ASORT_PREFIX
[[example-simple]]
Example
~~~~~~~
Let's sort an array of integers, in the usual way.
#define ASORT_PREFIX(X) intarr_##X
#define ASORT_KEY_TYPE int
#include <ucw/sorter/array-simple.h>
This generates an intarr_sort(int *array, uint array_size) function that
can be used the obvious way.
A more complicated example could be sorting a structure, where items
with odd indices are stored in one array, even in another. Each item
could be a structure containing a string and an integer. We would like
to sort them by the strings.
struct elem {
char *string;
int integer;
};
#include <string.h> // Because of strcmp
#define ASORT_PREFIX(X) complicated_##X
#define ASORT_KEY_TYPE struct elem
#define ASORT_ELT(i) ((i % 2 ? even_array : odd_array)[i / 2])
#define ASORT_LT(x, y) (strcmp((x).string, (y).string) < 0)
#define ASORT_EXTRA_ARGS , struct elem *odd_array, struct elem *even_array
#include <ucw/sorter/array-simple.h>
Now we got a complicated_sort(uint array_size, struct elem *odd_array,
struct *even_array) function to perform our sorting.
[[array]]
Huge array sorting
------------------
This one is very similar to the simple array sorter, but it is
optimised for huge arrays. It is used mostly by the
<<external,external sorter>> machinery described below, but you can
use it directly.
It is in the `sorter/array.h` header.
It differs in few details:
- It supports only continuous arrays, no indexing macro can be
provided.
- It is able to sort in parallel on SMP systems. It assumes all
callbacks you provide are thread-safe.
- If you provide a monotone hash function (if `hash(x) < hash(y)`, then
`x < y`, but `x` and `y` may differ when `hash(x) == hash(y)`), it
will use it to gain some more speed by radix-sort.
[[mandatory-array]]
Mandatory macros
~~~~~~~~~~~~~~~~
- `ASORT_PREFIX(x)` -- The identifier generating macro.
- `ASORT_KEY_TYPE` -- Type of elements in the array.
[[optional-array]]
Optional macros
~~~~~~~~~~~~~~~
- `ASORT_LT(x,y)` -- Comparing macro. Uses the `<` operator if not
provided.
- `ASORT_HASH(x)` -- A monotone hash function (or macro). Should
return `uint`.
- `ASORT_LONG_HASH(x)` -- Like `ASORT_HASH(x)`, but returns 64-bit
number instead of 32-bit.
- `ASORT_THRESHOLD` -- How small should a chunk of data be to be sorted
by insert-sort? Defaults to `8` elements.
- `ASORT_RADIX_BITS` -- How many bits of the hash function should be
used at once for radix-sort? The default is guessed from your
architecture.
!!ucw/sorter/array.h ASORT_PREFIX
[[external]]
External sorting
----------------
If you have too much data to fit into memory, you need to employ
external sorting. This external sorter operates on
<<fastbuf:,fastbufs>> containing sequences of items. Each item
consists of a key, optionally followed by data. Both the keys and data
may be of variable length, but the keys must be represented by
fixed-size type in memory. The length of data must be computable from
the key. Data are just copied verbatim, unless you use the merging
mode, in which data with the same keys get merged together.
All callbacks must be thread safe.
The sorter resides in the `sorter/sorter.h` header file.
[[basic-external]]
Basic macros
~~~~~~~~~~~~
You need to provide some basic macros. Some of them are optional.
- `SORT_PREFIX(x)` -- Identifier generating macro. This one is
mandatory.
- `SORT_KEY` -- Data structure holding the key of item in memory. The
representation on disk may be different. Either this one or
`SORT_KEY_REGULAR` must be provided.
- `SORT_KEY_REGULAR` -- You may use this instead of `SORT_KEY`, when
the keys have the same representation both in memory and on disk.
Then the sorter uses <<fastbuf:bread()>> and <<fastbuf:bwrite()>> to
load and store them. It also assumes the keys are not very long.
- `SORT_KEY_SIZE(key)` -- Returns the real size of the key. The sorter
can use this to save space and truncate the key to the given number
of bytes, when the keys have variable lengths. If the keys have
fixed sizes, there is no need for this macro.
- `SORT_DATA_SIZE(key)` -- Returns the amount of data following this
key. If you do not provide this one, the sorter assumes there are
only keys and no data.
[[callback-external]]
Callbacks
~~~~~~~~~
Furthermore, you need to provide these callback functions (make sure
they are thread safe):
- `int SORT_PREFIX(compare)(SORT_KEY *a, SORT_KEY *b)` -- Comparing
function. It should act like strcmp(). Mandatory unless provided by
<<integer-external,integer sorting>>.
- `int SORT_PREFIX(read_key)(struct fastbuf *f, SORT_KEY *k)` --
Should read a key from the provided <<fastbuf:,fastbuf>> @f and
store it into @k. Returns nonzero when ok and zero when an `EOF` was
met. Mandatory unless `SORT_KEY_REGULAR` is defined.
- `void SORT_PREFIX(write_key)(struct fastbuf *f, SORT_KEY *k)` --
Should store key @k into @f. Mandatory unless `SORT_KEY_REGULAR` is
defined.
[[integer-external]]
Integer sorting
~~~~~~~~~~~~~~~
If you sort by an integer value (either computed or available from
the key), you can use this to save yourself some functions. It also
activates the <<hash-external,hashing>> automatically.
- `SORT_INT(key)` -- This macro returns the integer to sort by. When
you provide it, the compare function is automatically provided for
you and the sorting function gets another parameter specifying the
range of the integers. The better the range fits, the faster the
sorting runs.
- `SORT_INT64(key)` -- The same, but with 64-bit integers.
[[hash-external]]
Hashing
~~~~~~~
If you have a monotone hash function for your keys, you may speed the
sorting up by providing it. Monotone hashing function must satisfy if
`hash(x) < hash(y)`, then `x < y`. It should be approximately
uniformly distributed.
When you want to use it, define `SORT_HASH_BITS` and set it to the
number of significant bits the hashing function provides. Then provide
a callback function `uint SORT_PREFIX(hash)(SORT_KEY *key)`.
[[merge-external]]
Merging items with identical keys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The sorter is able to merge items with the same keys (the compare
function returns `0` for them). To use it, define `SORT_UNIFY` macro
and provide these functions:
- `void SORT_PREFIX(write_merged)(struct fastbuf \*dest, SORT_KEY
\*\*keys, void \*\*data, uint n, void *buf)`
-- This function takes @n records in memory and writes a single
record into the @dest <<fastbuf:,fastbuf>>. The @keys and @data are
just the records. The @buf parameter points to a workspace memory.
It is guaranteed to hold at last the sum of `SUM_UNIFY_WORKSPACE()`
macro over all the keys. The function is allowed to modify all its
parameters.
- `void SORT_PREFIX(copy_merged)(SORT_KEY \*\*keys, struct fastbuf
\*\*data, uint n, struct fastbuf \*dest)`
-- This one is similar to the above one, but the data are still in
the <<fastbuf:,fastbufs>> @data and no workspace is provided. This
is only used when `SORT_DATA_SIZE` or `SORT_UNIFY_WORKSPACE` is
provided.
- `SORT_UNIFY_WORKSPACE(key)` -- Returns the amount of workspace
needed when merging this record. Defaults to `0`.
[[input-external]]
Specifying input
~~~~~~~~~~~~~~~~
To tell the sorter where is the input, you specify one of these
macros:
- `SORT_INPUT_FILE` -- The function takes a filename.
- `SORT_INPUT_FB` -- The input is a seekable fastbuf stream.
- `SORT_INPUT_PIPE` -- The input is a non-seekable fastbuf stream.
- `SORT_INPUT_PRESORT` -- The input is a custom presorter. In this
case, you need to write a presorting function `int
SORT_PREFIX(presort)(struct fastbuf *dest, void *buf, size_t
bufsize)`. The function gets a buffer @buf of size @buf_size to
presort in and is supposed to write presorted bunch of data into the
@dest buffer. Should return `1` on success or `0` on `EOF` (all it
could was already written, no more data). In this case, you can
safely pass NULL as the input parameter. The function may be used to
generate the data on the fly. The function does not have to be
thread safe (it can access global variables).
If you define `SORT_DELETE_INPUT` and it evaluates to true (nonzero),
the input files are deleted as soon as possible.
[[output-external]]
Specifying output
~~~~~~~~~~~~~~~~~
You can configure the output in a similar way. Define one of macros:
- `SORT_OUTPUT_FILE` -- The function takes a filename.
- `SORT_OUTPUT_FB` -- The function should be provided with NULL and
the fastbuf with data is returned.
- `SORT_THIS_FB` -- A fastbuf is provided to the function and it
writes into it. It can already contain some data.
[[other-external]]
Other switches
~~~~~~~~~~~~~~
You may define the `SORT_UNIQUE` macro if all keys are distinct. It is
checked in debug mode.
[[function-external]]
The generated function
~~~~~~~~~~~~~~~~~~~~~~
A `SORT_PREFIX(sort)()` function is generated after you include the
`sorter/sorter.h` header. It has up to three parameters:
- Input. It is either a string (a filename) if you use
`SORT_INPUT_FILE` or a fastbuf (otherwise). It should be set to NULL
if you use the `SORT_INPUT_PRESORT` input.
- Output. It is either a string (a filename) if you defined the
`SORT_OUTPUT_FILE` or a fastbuf. It must be NULL if you defined
`SORT_OUTPUT_FB`.
- Integer range. The maximum value of integers that are used in the
<<integer-external,integer sorting>>. This parameter is here only
if you defined `SORT_INT` or `SORT_INT64`.
The function returns a fastbuf you can read the data from.