Skip to contents

Given a text input, find up to num_suggestions possible occupation categories.


  suggestion_type = "auxco-1.2.x",
  num_suggestions = 5,
  suggestion_type_options = list(),
  aggregate_score_threshold = 0.02,
  item_score_threshold = 0,
  distinctions = TRUE,
  steps = list(simbased_wordwise = list(algorithm = algo_similarity_based_reasoning,
    parameters = list(sim_name = "wordwise")), simbased_substring = list(algorithm =
    algo_similarity_based_reasoning, parameters = list(sim_name = "substring"))),
  include_general_id = FALSE



The raw text input from the user.


Which type of suggestion to use / provide. Possible options are "auxco-1.2.x" and "kldb-2010".


The maximum number of suggestions to show. This is an upper bound and less suggestions may be returned. Defaults to 5.


A list with options for generating suggestions. Supported options: - datasets: Pass specific datasets to be used whenn adding information to predictions e.g. use a specific version of the kldb or auxco. Supported datasets are: "auxco-1.2.x", "kldb-2010". By default the datasets bundled with this package are used.


A single value or named list of thresholds between 0 and 1. If it is a list, each entry should correspond to one of the steps. If it is a single value, it will apply to all steps. Results from that step will only be returned if the sum of their scores is equal to or greater than the specified threshold. With a aggregate_score_threshold of 0 results will always be returned (if there are any).


A threshold between 0 and 1 (usually very small, default 0). Results from any step will only be returned if they are greater than the specified threshold. Allows the removal of highly implausible suggestions.


Whether or not to add additional distinctions to similar occupational categories to the source code. Defaults to TRUE.


A list with the algorithms to use and their parameters. Each entry of the list should contain a nested list with two entries: algorithm (the algorithm's function itself) and parameters (the parameters to pass onto the algorithm). Each algorithm will also always have access to a default set of three parameters:

  • text_processed: The input text after preprocessing

  • suggestion_type: Which type of suggestion to output

  • num_suggestions: How many suggestions shall be returned These parameters must not be specified manually and will be provided automatically instead. Defaults to:

  # try similarity "one word at most 1 letter different" first
    algorithm = algo_similarity_based_reasoning,
    parameters = list(
      sim_name = "wordwise",
      min_aggregate_prob = 0.535
  # since everything else failed, try "substring" similarity
    algorithm = algo_similarity_based_reasoning,
    parameters = list(
      sim_name = "substring",
      min_aggregate_prob = 0.02


Whether a general column, called "id" should always be returned. This will automatically contain the appropriate id for different suggestion_types i.e. for "auxco-1-2.x" it will contain the same data as the column "auxco_id".


A data.table with suggestions or NULL if no suggestions were found.


The procedure implemented here is, roughly speaking, as follows:

  1. Predict categories from KldB 2010, including their scores. The first algorithm mentioned in steps is used (default: algo_similarity_based_reasoning()).

  2. Convert the predicted KldB 2010 categories to suggestion_type (default: auxco-1.2.x, an n:m mapping, scores are mapped accordingly.). See internal function convert_suggestions() for details.

  3. Remove predicted categories if their score is below item_score_threshold and only keep the num_suggestions top-ranked suggestions.

  4. Start anew, trying the next algorithm in steps, if the the top-ranked suggestions have a low chance to be correct. (Technically, this happens if the summed score of the num_suggestions top-ranked suggestions is below aggregate_score_threshold.)

  5. If suggestion_type == "auxco-1.2.x" and distinctions == TRUE, insert additional and (highly) similar categories or replace existing ones. See internal function add_distinctions_auxco(). Reorder and keep only the num_suggestions top-ranked suggestions. Auxco categories which were added during this step can be identified by their scores: It equals 0.05 for categories with high similarity and 0.005 for categories with medium similarity.



if (FALSE) {
if (interactive()) {

if (interactive()) {