Data quality in Real Estate

  • Published on
    18-Mar-2018

  • View
    362

  • Download
    0

Transcript

  • Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference
  • About Geophy ● Goal to map all buildings in the world ● Provide a quality score for each building ○ Based on location, building status, history, environmental metrics, etc ● Semantic platform ○ RDF eases the data integration process ● Team of 45 with aim to double by next year
  • Real Estate is a very complex domain Really!
  • Possible constraints on addresses? ● An address will start with, or at least include, a building number. ● When there is a building number, it will be all-numeric. ● No buildings are numbered zero ● Well, at the very least no buildings have negative numbers ● A building number will only be used once per street ● A building will only have one number ● A building name won't also be a number ● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/
  • Geophy [set of] ontologies ● 13 ontologies (+ 9 external) ● 125 Classes ○ Buildings ○ Addresses ○ Companies ○ [...] ● 720 properties ○ 500 datatype ○ 160 relation properties ● Growing...
  • Quality is expensive ● Quality of source data ○ Free, open, closed data sources, etc. ● Data clean up process ○ Violations, deduplication, precision, etc. ○ How much time and effort can one afford? How much quality is good enough? � Fitness for use
  • Quality of ... ● Source data ○ Accuracy of the source ● Translation of source data ○ RDF mappings, rml, d2rq, scripts etc. ● Model design ○ Modelling quality ○ Data fitting on schema ● Model definition ○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc ○ Semantics i.e RDFS, OWL DL/RL/FULL, etc
  • Evolution & quality � Data evolves � so do ontologies � so do RDF mappings � so does code � so do SPARQL queries � so do constraints http://aligned-project.eu http://aligned-project.eu
  • Scaling quality ... ● Thousands of triples ● Millions of triples ● Billions of triples ● ? Try to move validation in the K range (when possible)
  • Validate closer to the source � Validate the model � Validate the RDF mappings � Validate RDF mapping excerpts � Validate instance data
  • Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString � :foo rdfs:label ″foo @en″ .
  • Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString � :foo rdfs:label ″foo @en″ . � :foo rdfs:label ″foo″@en .
  • CI/CD is your buddy ● Integrate validation with your CI/CD ○ Choose tools & technologies wisely ○ Jenkins, Travis, Gitlab, TeamCity ● Fail the build until data issues are fixed ● Data integration validation checks ○ Standalone datasets can pass CI
  • Thank you for your attention Questions?