Data quality in Real Estate
Data QualityIn Real EstateDimitris Kontokostas, Andy van der Hoeven, Samur AraujoAmsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS ConferenceAbout Geophy Goal to map all buildings in the world Provide a quality score for each building Based on location, building status, history, environmental metrics, etc Semantic platform RDF eases the data integration process Team of 45 with aim to double by next yearReal Estate is a very complex domainReally!Possible constraints on addresses? An address will start with, or at least include, a building number. When there is a building number, it will be all-numeric. No buildings are numbered zero Well, at the very least no buildings have negative numbers A building number will only be used once per street A building will only have one number A building name won't also be a number [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/Geophy [set of] ontologies 13 ontologies (+ 9 external) 125 Classes Buildings Addresses Companies [...] 720 properties 500 datatype 160 relation properties Growing...Quality is expensive Quality of source data Free, open, closed data sources, etc. Data clean up process Violations, deduplication, precision, etc. How much time and effort can one afford?How much quality is good enough? Fitness for useQuality of ... Source data Accuracy of the source Translation of source data RDF mappings, rml, d2rq, scripts etc. Model design Modelling quality Data fitting on schema Model definition Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc Semantics i.e RDFS, OWL DL/RL/FULL, etcEvolution & quality Data evolves so do ontologies so do RDF mappings so does code so do SPARQL queries so do constraintshttp://aligned-project.eu http://aligned-project.euScaling quality ... Thousands of triples Millions of triples Billions of triples ?Try to move validation in the K range (when possible)Validate closer to the source Validate the model Validate the RDF mappings Validate RDF mapping excerpts Validate instance dataAutomate, automate & automateCan you spot the error?rdfs:label rdf:langString :foo rdfs:label foo @en .Automate, automate & automateCan you spot the error?rdfs:label rdf:langString :foo rdfs:label foo @en . :foo rdfs:label foo@en .CI/CD is your buddy Integrate validation with your CI/CD Choose tools & technologies wisely Jenkins, Travis, Gitlab, TeamCity Fail the build until data issues are fixed Data integration validation checks Standalone datasets can pass CI Thank you for your attentionQuestions?