Address Parsing Using TypeScript & Neo4j

I'm trying to find a house in the incredibly hostile Dublin housing market by gathering data to spot the diamond in the rough of all available property-deals, ideally in a way few others spot. Basically I'm testing asymmetric information as a buyer in practice. I won't cover my project Castle in any great detail at the moment, but these is one aspect I'd like to talk about.

Part of the project involves:

Unfortunately, address parsing is hard. I initially fragmented addresses as a CSV and tried to work out house numbers, estates, counties, suburbs, and towns algorithmically but most addresses ended up mislabelled. Also the code was painful:

parse() {
const fragments = this.long.toLowerCase().split(',').map(str => str.trim())

const numberEstatePattern = /(?<houseNumber0>^[0-9]+[a-zA-Z]?)\s+(?<estate0>[^,]+)|^Apartment (?<houseNumber1>[0-9]+[aZAZ]?)/i
const matches = numberEstatePattern.exec(this.long)

const groups = matches?.groups

if (groups?.houseNumber0) {
const parsed = parseInt(groups?.houseNumber0)

if (!Number.isNaN(parsed)) {
this.houseNumber = parsed
}
} else if (groups?.houseNumber1) {
const parsed = parseInt(groups?.houseNumber1)

if (!Number.isNaN(parsed)) {
// -- much more nasty code.

I found libpostal, an ML library trained against billions of addresses works a lot better. There are still defects; for example it's good at detecting towns but still has a habit of mislabelling estates and potentially missing the embedded townname in the estate

I purchased a dataset including the name of all Irish settlements and their geospatial coordinates, so I'll be able to remedy this manually in future. Despite this problem the data is usuable, so I now have a map of prices by town and estate for all of Ireland! This is directly useful for estimating "true" property value by area, but I also plan to create a Voronoi Diagram of Ireland and hang an A1 print in the apartment it eventually helps me buy!

Apart from parsing the addresses the componenets also need to be related together to be searchable. I previously used Sqlite as my data-store, but as the table-number climbed it got a bit cumbersome to analyse so I'm currently migrating to Neo4j. This helps as it treats relationships as first-class entities, so I can create a preliminary link of properties to estates/towns and then run transformations to improve graph connectness and remove invalid links and nodes.

match(p: Property) - [ref: IN] - (t: Town), (p: Property)-[: IN] - (e: Estate)
where not(e) - [:IN] - (t)
merge(e) - [:IN] - (t)

Relationship rewrites like this are application-level in Sqlite, but relatively easy in Neo4j.

The data passes basic checks like the average house-price in Ireland in 2020 being €313,000 (!), so I think it's ready to use. I'm excited to see what inferences I can draw from this data set!

Takeaway Points:

Addendum, October 2024

It worked in the end; I found a relatively cheap flat using Castle that I otherwise would probably have missed.