llmsherpa.readers package

Submodules

llmsherpa.readers.file_reader module

class llmsherpa.readers.file_reader.LayoutPDFReader(parser_api_url)

Bases: object

Reads PDF content and understands hierarchical layout of the document sections and structural components such as paragraphs, sentences, tables, lists, sublists

Parameters:

parser_api_url (str) – API url for LLM Sherpa. Use customer url for your private instance here

read_pdf(path_or_url, contents=None)

Reads pdf from a url or path

Parameters:
  • path_or_url (str) – path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf

  • contents (bytes) – contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.

llmsherpa.readers.layout_reader module

class llmsherpa.readers.layout_reader.Block(block_json=None)

Bases: object

A block is a node in the layout tree. It can be a paragraph, a list item, a table, or a section header. This is the base class for all blocks such as Paragraph, ListItem, Table, Section.

tag

tag of the block e.g. para, list_item, table, header

Type:

str

level

level of the block in the layout tree

Type:

int

page_idx

page index of the block in the document. It starts from 0 and is -1 if the page number is not available

Type:

int

block_idx

id of the block as returned from the server. It starts from 0 and is -1 if the id is not available

Type:

int

top

top position of the block in the page and it is -1 if the position is not available - only available for tables

Type:

float

left

left position of the block in the page and it is -1 if the position is not available - only available for tables

Type:

float

bbox

bounding box of the block in the page and it is [] if the bounding box is not available

Type:

[float]

sentences

list of sentences in the block

Type:

list

children

list of immediate child blocks, but not the children of the children

Type:

list

parent

parent of the block

Type:

Block

block_json

json returned by the parser API for the block

Type:

dict

add_child(node)

Adds a child to the block. Sets the parent of the child to self.

chunks()

Returns all the chunks in the block. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.

iter_children(node, level, node_visitor)

Iterates over all the children of the node and calls the node_visitor function on each child.

paragraphs()

Returns all the paragraphs in the block. This is useful for getting all the paragraphs in a section.

parent_chain()

Returns the parent chain of the block consisting of all the parents of the block until the root.

parent_text()

Returns the text of the parent chain of the block. This is useful for adding section information to the text.

sections()

Returns all the sections in the block. This is useful for getting all the sections in a document.

tables()

Returns all the tables in the block. This is useful for getting all the tables in a section.

tag: str
to_context_text(include_section_info=True)

Returns the text of the block with section information. This provides context to the text.

to_html(include_children=False, recurse=False)

Converts the block to html. This is a virtual method and should be implemented by the derived classes.

to_text(include_children=False, recurse=False)

Converts the block to text. This is a virtual method and should be implemented by the derived classes.

class llmsherpa.readers.layout_reader.Document(blocks_json)

Bases: object

A document is a tree of blocks. It is the root node of the layout tree.

chunks()

Returns all the chunks in the document. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.

sections()

Returns all the sections in the document. This is useful for getting all the sections in a document.

tables()

Returns all the tables in the document. This is useful for getting all the tables in a document.

to_html(include_duplicates=False)

Returns html for the document by iterating through all the sections :param include_duplicates: bool

If True, then html of all the sections is included. If False, then only the html of the top sections is included.

to_text(include_duplicates=False)

Returns text of a document by iterating through all the sections ‘

param include_duplicates:

bool If True, then text of all the sections is included. If False, then only the text of the top sections is included.

class llmsherpa.readers.layout_reader.LayoutReader

Bases: object

Reads the layout tree from the json returned by the parser API.

debug(pdf_root)
read(blocks_json)

Reads the layout tree from the json returned by the parser API. Constructs a tree of Block objects.

class llmsherpa.readers.layout_reader.ListItem(list_json)

Bases: Block

A list item is a block of text. It can have child list items. A list item has tag ‘list_item’.

to_html(include_children=False, recurse=False)

Converts the list item to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the html of the children are also included

  • recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)

Converts the list item to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the text of the children are also included

  • recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Paragraph(para_json)

Bases: Block

A paragraph is a block of text. It can have children such as lists. A paragraph has tag ‘para’.

to_html(include_children=False, recurse=False)

Converts the paragraph to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the html of the children are also included

  • recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)

Converts the paragraph to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the text of the children are also included

  • recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Section(section_json)

Bases: Block

A section is a block of text. It can have children such as paragraphs, lists, and tables. A section has tag ‘header’.

title

title of the section

Type:

str

to_html(include_children=False, recurse=False)

Converts the section to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the html of the children are also included

  • recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)

Converts the section to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:
  • include_children (bool) – If True, then the text of the children are also included

  • recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Table(table_json, parent)

Bases: Block

A table is a block of text. It can have child table rows. A table has tag ‘table’.

to_html(include_children=False, recurse=False)

Returns html for a <table> with html from all the rows in the table as <tr>

to_text(include_children=False, recurse=False)

Returns text of a table with text from all the rows in the table delimited by ‘

class llmsherpa.readers.layout_reader.TableCell(cell_json)

Bases: Block

A table cell is a block of text. It can have child paragraphs. A table cell has tag ‘table_cell’. A table cell is contained within table rows.

to_html()

Returns the cell value ashtml. If the cell value is a paragraph node, then the html of the node is returned.

to_text()

Returns the cell value of the text. If the cell value is a paragraph node, then the text of the node is returned.

class llmsherpa.readers.layout_reader.TableHeader(row_json)

Bases: Block

A table header is a block of text. It can have child table cells.

to_html(include_children=False, recurse=False)

Returns html for a <th> with html from all the cells in the row as <td>

to_text(include_children=False, recurse=False)

Returns text of a row with text from all the cells in the row delimited by ‘|’ and the header row is delimited by ‘—’ Text is returned in markdown format.

class llmsherpa.readers.layout_reader.TableRow(row_json)

Bases: Block

A table row is a block of text. It can have child table cells.

to_html(include_children=False, recurse=False)

Returns html for a <tr> with html from all the cells in the row as <td>

to_text(include_children=False, recurse=False)

Returns text of a row with text from all the cells in the row delimited by ‘|’

Module contents