Supported File Types
Raw document types
The upload endpoint supports several raw document types. Vectara extracts text
from these documents and sections them as best it can. This provides a
convenient way to index text, yet the caller has less control compared to when
providing the Document proto message themselves. The following raw document
types are supported:
- Commonmark / Markdown (
mdextension). - PDF/A (
pdf). - Open Office (
odt). - Microsoft Word (
doc,docx). - Microsoft Powerpoint (
ppt,pptx). - Text files (
txt). - HTML files (
.html). - LXML files (
.lxml). - RTF files (
.rtf). - ePUB files (
.epub). - Email files conforming to RFC 822.
 
Semi-structured documents
In addition, the upload endpoint supports
sending semi-structured documents through this endpoint that reflect a
Document proto message.  Those can be sent in the following formats:
pb: Contains binary serializedDocumentproto message.pbtxt: ContainsDocumentproto message in proto text format.json: ContainsDocumentproto message in json text format.
For more details of how to format these types of files, read the formatting document