Skip to contents

Replace some common characters / character sequences (e.g., Ä, Ü, "DIPL.-ING.") with their uppercase equivalents and removes punctuation, empty spaces and the word "Diplom".

Usage

preprocess_string(verbatim, lang = "de")

Arguments

verbatim

The character vector to process.

lang

The language the text is in. Currently only German is supported. Defaults to "de" (German).

Value

The same character vector after processing

Details

charToRaw() helps to find UTF-8 characters.

Examples

data.table::setDTthreads(1)

if (FALSE) {
preprocess_string(c(
  "Verkauf von B\u00fcchern, Schreibwaren",
  "Fach\u00e4rztin f\u00fcr Kinder- und Jugendmedizin im \u00f6ffentlichen Gesundheitswesen",
  "Industriemechaniker",
  "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"
))
}