An outline on how you can build a relevance model to target records of interest for NLP classification. I will describe some relevance characteristics based on construction data records found on open source dataset. The objective is to setup a program to mine data and extract target records based on some pre-defined criteria. For example, in the data set below we are looking for the following topics which can be found a priori using latent dirichlet allocation (LDA):
Size (sq feet)
Residence Type (non-residential)
Development Scale (large)
To do this you can setup a keyword rake and find words in your text fields that can be associated with each category. The Udpipe package has some language annotation models that remove stop words and can group words together into linked sentences.
At the end of this exercise you can find mining words that are associated with each category above. In this example, we get:
sq.foot.target = 1000
renovation.keywords <- c(“bath”, “bathroom”, “bedroom”, “garage”, “crawl space”, “courtyard”, “driveway”, “electrical”, “door”, “window”, “expansion”, “renovation”, “family room”, “fence”, “greenhouse”, “guestroom”, “guest room”, “deck”, “roof”, “partition”, “balcony”, “siding”)
not.residential.keywords <- c(“commercial”, “unit building”, “industrial building”, “community facility”, “townhouse”, “grocery store”, “retail space”, “new storefront”, “office building”, “story”, “condo”, “hotel”)
square.feet.keywords <- c(“sq.ft”, “square foot”, “gross sq.ft”, “sq ft”, “sq.”, “sf lot”, “ft”, “sqft”, “sf”, “gsf”, “square”)
Now you can setup a parsing script to look for these features with a relevance target by setting up a boolean which returns any records that match the keywords in any of the categories above.
For your viewing pleasure, here is a sample keyword rake report:
Comments