r - Cleaning and editing a column -


i've been trying figure out how clean , edit column in data set.

the dataset using supposed city of san francisco. column in data set called "city" contains multiple different spellings of san francisco, other cities. here looks like:

table(sf$city)

                                  brentwood                      ca                30401                      18                     370            daly city             foster city                 hayward                    0                       0                       0               novato                 oakland                 oakland                    0                      40                       0                  s f                    s.f.                 s.f. ca                    0                   31428                      12            san bruno           san francicso          san franciisco                    0                     221                      54        san francisco           san francisco           san francisco                   20                     284                       0        san francisco           san francisco        san francisco ca                78050                   16603                       6       san francisco,       san francisco, ca       san francisco, ca                   12                       4                      72  san francisco, ca 94132          san franciscvo           san francsico                    0                       0                       2       san franicisco          sand francisco                      sf                   41                      30                      17                   sf                      sf                 sf , ca                  214                   81226                       1         sf ca  94133                  sf, ca            sf, ca 94110                    0                       9                      38         sf, ca 94115                     sf.                     sf`                    4                    1656                      31    so. san francisco                 so.s.f.                    0                       6              

what trying change sf$city have "san francisco". data in sf$city placed under 1 city, san francisco. when type table(sf$city), shows san francisco.

could subset? like:

sf$city = subset(sf, city == "s.f." & "s.f. ca" & "san francicso" & ... 

and subset city variables want? or distort , mess data?

i try regular expressions agrep , grep.

example data:

d <- c("brentwood", "ca", "daly city", "foster city", "hayward", "novato",  "oakland", "oakland", "s f", "s.f.", "s.f. ca", "san bruno",  "san francicso", "san franciisco", "san francisco", "san francisco",  "san francisco", "san francisco", "san francisco", "san francisco ca",  "san francisco,", "san francisco, ca", "san francisco, ca", "san francisco, ca 94132",  "san franciscvo", "san francsico", "san franicisco", "sand francisco",  "sf", "sf", "sf", "sf , ca", "sf ca", "94133", "sf, ca", "sf, ca 94110",  "sf, ca 94115", "sf.", "sf`", "so. san francisco", "so.s.f.") 

you can target words "san francisco" agrep, , default of max.dist = 0.1 works enough here. can target s.f. variants using grep

d[agrep("san francisco", d, ignore.case = true, max.dist = 0.1)] <- "san francisco"  d[grep("\\bs[. ]?f\\.?\\b", d, ignore.case = true, perl = true)] <- "san francisco"  # [1] "brentwood"     "ca"            "daly city"     "foster city"   # [5] "hayward"       "novato"        "oakland"       "oakland"       # [9] "san francisco" "san francisco" "san francisco" "san bruno"     #[13] "san francisco" "san francisco" "san francisco" "san francisco" #[17] "san francisco" "san francisco" "san francisco" "san francisco" #[21] "san francisco" "san francisco" "san francisco" "san francisco" #[25] "san francisco" "san francisco" "san francisco" "san francisco" #[29] "san francisco" "san francisco" "san francisco" "san francisco" #[33] "san francisco" "94133"         "san francisco" "san francisco" #[37] "san francisco" "san francisco" "san francisco" "san francisco" #[41] "san francisco" 

adist option targeting words "san francisco". found following settings work well. can pick "san fran":

d[adist("san francisco", d, ignore.case = true,     cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "san francisco" 

Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

java - Digest auth with Spring Security using javaconfig -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -