Skip to main content

Files Metadata

The actual contents of a file are in an S3 attachment, however small snippets of information about a file can be contained directly within its record, including metadata (e.g. extracted by a given app).

Please look here Special Concepts, Fields and Relationships in the Standard Octostar Ontology for more details about file-reserved fields.

Currently for files the os_item_content field should contain a JSON (there may be older, legacy files that do not follow this standard). There is no standard for the contents of the JSON, but the following fields are currently being written/read by some applications in the Octostar environment:

  • extract:metadata ➝ contains a dictionary with metadata extracted by some application. May include fields like Content-Type, dc:creator and many others (according to what was extracted)
  • extract:src ➝ specifies the source of extraction, e.g. doc-extract for Document Extractor app. Should be unique per app
  • extract:srcLang ➝ specifies the language of the extracted textual contents
  • extract:txt ➝ the text summary extracted from this file (e.g. audio transcription, text extraction, OCR...)
  • ner:entities ➝ A list of Named Entity Recognition entities, each in format [name, label, context, score, comentions]
  • image:annotations ➝ A list of image annotations, each in format {source, label, bbox:{x1,y1,x2,y2}, score, *kwargs}
  • image ➝ The file thumbnail, in format {width, height, blurhash}