Preprocess a string, removing special characters and handling abbreviations.
Source:R/helper_functions.R
preprocess_string.Rd
Replace some common characters / character sequences (e.g., Ä, Ü, "DIPL.-ING.") with their uppercase equivalents and removes punctuation, empty spaces and the word "Diplom".
Arguments
- verbatim
The character vector to process.
- lang
The language the text is in. Currently only German is supported. Defaults to "de" (German).
Details
charToRaw()
helps to find UTF-8 characters.
Examples
data.table::setDTthreads(1)
if (FALSE) {
preprocess_string(c(
"Verkauf von B\u00fcchern, Schreibwaren",
"Fach\u00e4rztin f\u00fcr Kinder- und Jugendmedizin im \u00f6ffentlichen Gesundheitswesen",
"Industriemechaniker",
"Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"
))
}