Project tasks/Roadmap:

Phase 1

  • [x] rewrite intake:

    • [x] JSON parse into Redis, creates: article_id,paragraph_id

    • [x] Redis paragraph split into sentences (BERT tokenised) into article_id, sentence_id, sentence

    • [x] processed article keys are store cbloom in redis

  • [x] Detect sentence language

  • [x] Apply symspell

  • [x] tokenise sentence, storing model details in DB, input sentence, output tokenised sentence, - [ ] Idea worth trying: add tokens to ids and feed into BART model deployed on RedisAI to create a summary of article. - [] change tokeniser in two parts so output is ids and written into tensor to be fed into RedisAI BART model for summary of the article (parked) [ ] change tokeniser so output is strings (return as strings from tokeniser), add stopwords and punctuation removal into the same step

  • [x] Remove stopwords

  • [ ] Expand abbreviations, store abbreviations dictionary in Redis (cache)

Phase 2

  • [x] Match tokens to OWL ready search token to canonical term, store:

    • canonical_term, sentence_key

    • synonim, sentence_key

  • [x] Create Aho corasick from above - need for matching input as well

  • [x] Form pairs and create:

    • [x] node, rank

      • [x] set of article_keys mapped to node

    • [x] edge, rank

      • [] set of article_key mapped to edge

  • [ ] Idea worth trying: Use write behind pattern to automatically map nodes and edges into Redis Graph

Phase 3

  • [ ] Create a node Article with attributes {id}, title, sentence_key:sentence

  • [ ] Visualisation D3

  • [ ] search terms matched into aho corasic

  • [ ] nodes + edges

    • [ ] on click to node list articles

    • [ ] on click to edge list articles

    • [ ] On mouse over show definition of term

  • [ ] add autocomplete into search

Datamodel for Visualisation: datamodel:

  • node is a medical term from UMLS (medical dictionary). It will have a properties: canonical name, rank, description (and edges). It can will have synonyms (internally)

  • edge is pair of nodes (terms) met in article. Edge will have a list of articles (article_id) associated with it, sorted by - each edge have a rank, we can change thickness of it