llmsherpa.readers package¶
Submodules¶
llmsherpa.readers.file_reader module¶
- class llmsherpa.readers.file_reader.LayoutPDFReader(parser_api_url)¶
Bases:
objectReads PDF content and understands hierarchical layout of the document sections and structural components such as paragraphs, sentences, tables, lists, sublists
- Parameters:
parser_api_url (str) – API url for LLM Sherpa. Use customer url for your private instance here
- read_pdf(path_or_url, contents=None)¶
Reads pdf from a url or path
- Parameters:
path_or_url (str) – path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
contents (bytes) – contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.
llmsherpa.readers.layout_reader module¶
- class llmsherpa.readers.layout_reader.Block(block_json=None)¶
Bases:
objectA block is a node in the layout tree. It can be a paragraph, a list item, a table, or a section header. This is the base class for all blocks such as Paragraph, ListItem, Table, Section.
- tag¶
tag of the block e.g. para, list_item, table, header
- Type:
str
- level¶
level of the block in the layout tree
- Type:
int
- page_idx¶
page index of the block in the document. It starts from 0 and is -1 if the page number is not available
- Type:
int
- block_idx¶
id of the block as returned from the server. It starts from 0 and is -1 if the id is not available
- Type:
int
- top¶
top position of the block in the page and it is -1 if the position is not available - only available for tables
- Type:
float
- left¶
left position of the block in the page and it is -1 if the position is not available - only available for tables
- Type:
float
- bbox¶
bounding box of the block in the page and it is [] if the bounding box is not available
- Type:
[float]
- sentences¶
list of sentences in the block
- Type:
list
- children¶
list of immediate child blocks, but not the children of the children
- Type:
list
- block_json¶
json returned by the parser API for the block
- Type:
dict
- add_child(node)¶
Adds a child to the block. Sets the parent of the child to self.
- chunks()¶
Returns all the chunks in the block. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.
- iter_children(node, level, node_visitor)¶
Iterates over all the children of the node and calls the node_visitor function on each child.
- paragraphs()¶
Returns all the paragraphs in the block. This is useful for getting all the paragraphs in a section.
- parent_chain()¶
Returns the parent chain of the block consisting of all the parents of the block until the root.
- parent_text()¶
Returns the text of the parent chain of the block. This is useful for adding section information to the text.
- sections()¶
Returns all the sections in the block. This is useful for getting all the sections in a document.
- tables()¶
Returns all the tables in the block. This is useful for getting all the tables in a section.
- tag: str¶
- to_context_text(include_section_info=True)¶
Returns the text of the block with section information. This provides context to the text.
- to_html(include_children=False, recurse=False)¶
Converts the block to html. This is a virtual method and should be implemented by the derived classes.
- to_text(include_children=False, recurse=False)¶
Converts the block to text. This is a virtual method and should be implemented by the derived classes.
- class llmsherpa.readers.layout_reader.Document(blocks_json)¶
Bases:
objectA document is a tree of blocks. It is the root node of the layout tree.
- chunks()¶
Returns all the chunks in the document. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.
- sections()¶
Returns all the sections in the document. This is useful for getting all the sections in a document.
- tables()¶
Returns all the tables in the document. This is useful for getting all the tables in a document.
- to_html(include_duplicates=False)¶
Returns html for the document by iterating through all the sections :param include_duplicates: bool
If True, then html of all the sections is included. If False, then only the html of the top sections is included.
- to_text(include_duplicates=False)¶
Returns text of a document by iterating through all the sections ‘
- ‘
- param include_duplicates:
bool If True, then text of all the sections is included. If False, then only the text of the top sections is included.
- class llmsherpa.readers.layout_reader.LayoutReader¶
Bases:
objectReads the layout tree from the json returned by the parser API.
- debug(pdf_root)¶
- read(blocks_json)¶
Reads the layout tree from the json returned by the parser API. Constructs a tree of Block objects.
- class llmsherpa.readers.layout_reader.ListItem(list_json)¶
Bases:
BlockA list item is a block of text. It can have child list items. A list item has tag ‘list_item’.
- to_html(include_children=False, recurse=False)¶
Converts the list item to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included
- to_text(include_children=False, recurse=False)¶
Converts the list item to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included
- class llmsherpa.readers.layout_reader.Paragraph(para_json)¶
Bases:
BlockA paragraph is a block of text. It can have children such as lists. A paragraph has tag ‘para’.
- to_html(include_children=False, recurse=False)¶
Converts the paragraph to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included
- to_text(include_children=False, recurse=False)¶
Converts the paragraph to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included
- class llmsherpa.readers.layout_reader.Section(section_json)¶
Bases:
BlockA section is a block of text. It can have children such as paragraphs, lists, and tables. A section has tag ‘header’.
- title¶
title of the section
- Type:
str
- to_html(include_children=False, recurse=False)¶
Converts the section to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included
- to_text(include_children=False, recurse=False)¶
Converts the section to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.
- Parameters:
include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included
- class llmsherpa.readers.layout_reader.Table(table_json, parent)¶
Bases:
BlockA table is a block of text. It can have child table rows. A table has tag ‘table’.
- to_html(include_children=False, recurse=False)¶
Returns html for a <table> with html from all the rows in the table as <tr>
- to_text(include_children=False, recurse=False)¶
Returns text of a table with text from all the rows in the table delimited by ‘
‘
- class llmsherpa.readers.layout_reader.TableCell(cell_json)¶
Bases:
BlockA table cell is a block of text. It can have child paragraphs. A table cell has tag ‘table_cell’. A table cell is contained within table rows.
- to_html()¶
Returns the cell value ashtml. If the cell value is a paragraph node, then the html of the node is returned.
- to_text()¶
Returns the cell value of the text. If the cell value is a paragraph node, then the text of the node is returned.
- class llmsherpa.readers.layout_reader.TableHeader(row_json)¶
Bases:
BlockA table header is a block of text. It can have child table cells.
- to_html(include_children=False, recurse=False)¶
Returns html for a <th> with html from all the cells in the row as <td>
- to_text(include_children=False, recurse=False)¶
Returns text of a row with text from all the cells in the row delimited by ‘|’ and the header row is delimited by ‘—’ Text is returned in markdown format.
- class llmsherpa.readers.layout_reader.TableRow(row_json)¶
Bases:
BlockA table row is a block of text. It can have child table cells.
- to_html(include_children=False, recurse=False)¶
Returns html for a <tr> with html from all the cells in the row as <td>
- to_text(include_children=False, recurse=False)¶
Returns text of a row with text from all the cells in the row delimited by ‘|’