r - Cleaning and editing a column -
i've been trying figure out how clean , edit column in data set.
the dataset using supposed city of san francisco. column in data set called "city" contains multiple different spellings of san francisco, other cities. here looks like:
table(sf$city)
brentwood ca 30401 18 370 daly city foster city hayward 0 0 0 novato oakland oakland 0 40 0 s f s.f. s.f. ca 0 31428 12 san bruno san francicso san franciisco 0 221 54 san francisco san francisco san francisco 20 284 0 san francisco san francisco san francisco ca 78050 16603 6 san francisco, san francisco, ca san francisco, ca 12 4 72 san francisco, ca 94132 san franciscvo san francsico 0 0 2 san franicisco sand francisco sf 41 30 17 sf sf sf , ca 214 81226 1 sf ca 94133 sf, ca sf, ca 94110 0 9 38 sf, ca 94115 sf. sf` 4 1656 31 so. san francisco so.s.f. 0 6
what trying change sf$city have "san francisco". data in sf$city placed under 1 city, san francisco. when type table(sf$city), shows san francisco.
could subset? like:
sf$city = subset(sf, city == "s.f." & "s.f. ca" & "san francicso" & ...
and subset city variables want? or distort , mess data?
i try regular expressions agrep
, grep
.
example data:
d <- c("brentwood", "ca", "daly city", "foster city", "hayward", "novato", "oakland", "oakland", "s f", "s.f.", "s.f. ca", "san bruno", "san francicso", "san franciisco", "san francisco", "san francisco", "san francisco", "san francisco", "san francisco", "san francisco ca", "san francisco,", "san francisco, ca", "san francisco, ca", "san francisco, ca 94132", "san franciscvo", "san francsico", "san franicisco", "sand francisco", "sf", "sf", "sf", "sf , ca", "sf ca", "94133", "sf, ca", "sf, ca 94110", "sf, ca 94115", "sf.", "sf`", "so. san francisco", "so.s.f.")
you can target words "san francisco" agrep
, , default of max.dist = 0.1 works enough here. can target s.f. variants using grep
d[agrep("san francisco", d, ignore.case = true, max.dist = 0.1)] <- "san francisco" d[grep("\\bs[. ]?f\\.?\\b", d, ignore.case = true, perl = true)] <- "san francisco" # [1] "brentwood" "ca" "daly city" "foster city" # [5] "hayward" "novato" "oakland" "oakland" # [9] "san francisco" "san francisco" "san francisco" "san bruno" #[13] "san francisco" "san francisco" "san francisco" "san francisco" #[17] "san francisco" "san francisco" "san francisco" "san francisco" #[21] "san francisco" "san francisco" "san francisco" "san francisco" #[25] "san francisco" "san francisco" "san francisco" "san francisco" #[29] "san francisco" "san francisco" "san francisco" "san francisco" #[33] "san francisco" "94133" "san francisco" "san francisco" #[37] "san francisco" "san francisco" "san francisco" "san francisco" #[41] "san francisco"
adist
option targeting words "san francisco". found following settings work well. can pick "san fran":
d[adist("san francisco", d, ignore.case = true, cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "san francisco"
Comments
Post a Comment