Project tasks/Roadmap:¶
Phase 1¶
[x] rewrite intake:
[x] JSON parse into Redis, creates: article_id,paragraph_id
[x] Redis paragraph split into sentences (BERT tokenised) into article_id, sentence_id, sentence
[x] processed article keys are store cbloom in redis
[x] Detect sentence language
[x] Apply symspell
[x] tokenise sentence, storing model details in DB, input sentence, output tokenised sentence, - [ ] Idea worth trying: add tokens to ids and feed into BART model deployed on RedisAI to create a summary of article. - [] change tokeniser in two parts so output is ids and written into tensor to be fed into RedisAI BART model for summary of the article (parked) [ ] change tokeniser so output is strings (return as strings from tokeniser), add stopwords and punctuation removal into the same step
[x] Remove stopwords
[ ] Expand abbreviations, store abbreviations dictionary in Redis (cache)
Phase 2¶
[x] Match tokens to OWL ready search token to canonical term, store:
canonical_term, sentence_key
synonim, sentence_key
[x] Create Aho corasick from above - need for matching input as well
[x] Form pairs and create:
[x] node, rank
[x] set of article_keys mapped to node
[x] edge, rank
[] set of article_key mapped to edge
[ ] Idea worth trying: Use write behind pattern to automatically map nodes and edges into Redis Graph
Phase 3¶
[ ] Create a node Article with attributes {id}, title, sentence_key:sentence
[ ] Visualisation D3
[ ] search terms matched into aho corasic
[ ] nodes + edges
[ ] on click to node list articles
[ ] on click to edge list articles
[ ] On mouse over show definition of term
[ ] add autocomplete into search
Datamodel for Visualisation: datamodel:
node is a medical term from UMLS (medical dictionary). It will have a properties: canonical name, rank, description (and edges). It can will have synonyms (internally)
edge is pair of nodes (terms) met in article. Edge will have a list of articles (article_id) associated with it, sorted by - each edge have a rank, we can change thickness of it