Tile Health Monitor

Overview

tile_health_monitor.py contains a class which contains methods for monitoring and control of TPM Health Monitoring.

Note

This class is inherited by tile.py, but is seperate for increased readability.

Python Class & Methods Index

Hardware functions for monitoring of TPM hardware health status.

This depends heavily on the pyfabil low level software and specific hardware module plugins.

class pyaavs.tile_health_monitor.TileHealthMonitor[source]

Tile Health Monitor Mixin Class, must be inherited by Tile Class

all_monitoring_categories()[source]

Returns a list of all monitoring point ‘categories’. Here categories is a super-set of monitoring points and is the full list of accepted strings to set_monitoring_point_attr. For example, these monitoring points: voltages.5V0 io.udp_interface.crc_error_count.FPGA0

would have these associated categories: ‘voltages’ ‘voltages.5V0’ ‘io’ ‘io.udp_interface’ ‘io.udp_interface.crc_error_count’ ‘io.udp_interface.crc_error_count.FPGA0’

More info at https://confluence.skatelescope.org/x/nDhED

Returns:

list of categories

Return type:

list of strings

all_monitoring_points()[source]

Returns a list of all monitoring points by finding all leaf nodes in the lookup dict that have a corresponding method field.

The monitoring points returned are strings produced from ‘.’ delimited keys. For example: ‘voltages.5V0’ ‘io.udp_interface.crc_error_count.FPGA0’

More info at https://confluence.skatelescope.org/x/nDhED

Returns:

list of monitoring points

Return type:

list of strings

check_ad9528_pll_status()[source]

Status of TPM AD9528 PLL chip

This method returns lock status True if both PLLs in the AD9528 are locked. The lock loss counter increments for a loss of lock event on either PLL.

Returns:

current lock status and lock loss counter value

:rtype tuple

check_adc_pll_status(adc_id=None)[source]

Status of ADC PLL.

This method returns a tuple, True if the lock of the PLL is up and True if no loss of PLL lock has been observed respectively.

NOTE: AD9680 used in TPM 1.2 does not support loss of lock bit, only current lock status. Will return None.

A dictionary is returned with an entry for each ADC.

Returns:

(True, True) if lock is up and no loss of lock

:rtype dict of tuple

check_adc_sysref_counter(adc_id=None, show_info=True)[source]

Checks ADC sysref counter is incrementing. Sysref counter increments for each sysref event and overflows at 255 ~ every 3.28ms.

Returns True if counter is incrementing for a given ADC. Returns a dictionary of bool, one for each ADC.

Will retry for 1 second until two readings can be taken in under 3ms to guarantee no overflow.

For debugging, if show info is enabled then each counter reading will be displayed along with the elapsed time.

Parameters:
  • adc_id (integer) – Specify which ADC, 0-15, None for all ADCs

  • show_info (bool) – displays info messages

Returns:

True if sysref counter incrementing

:rtype dict of bool

check_adc_sysref_setup_and_hold(adc_id=None, show_info=True)[source]

Status of the ADC status and hold monitor. Returns True if no setup or hold error for a given ADC. Returns a dictionary of bool, one for each ADC.

If show info enabled then desciptions from AD9695/AD9680 documentation are also displayed to explain the value of the setup and hold monitor.

Parameters:
  • adc_id (integer) – Specify which ADC, 0-15, None for all ADCs

  • show_info (bool) – displays info messages about current setup/hold

Returns:

True if timing requirements OK

:rtype dict of bool

check_clock_manager_status(fpga_id=None, name=None)[source]

Check status of named TPM clock manager cores (MMCM Core). Reports the status of each MMCM lock and its lock loss counter.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • name (string) – Specify name of clock manager (non case sensitive)

Returns:

Status and Counter values

:rtype dict

check_clock_status(fpga_id=None, clock_name=None)[source]

Check status of named TPM clocks Options ‘jesd’, ‘ddr’, ‘udp’ Input is non case sensitive An FPGA ID can be optionally specified to only check status on one FPGA

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • clock_name (string) – Specify name of clock or None for all clocks

Returns:

True when Status is OK, no errors

:rtype bool

check_ddr_initialisation(fpga_id=None)[source]

Check whether DDR has initialised.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

true if all OK

Return type:

bool

check_ddr_parity_error_counter(fpga_id=None)[source]

Check status of DDR parity error counter - used only with station beamformer

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

counter values

Return type:

dict

check_ddr_reset_counter(fpga_id=None, show_result=True)[source]

Check status of DDR user reset counter - increments each falling edge of the DDR generated user logic reset.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • show_result (bool) – prints error counts on logger

Returns:

counter values

Return type:

dict

check_f2f_hard_errors()[source]

Check F2F for hard errors. Asserted until the core resets.

Returns:

hard_err register value

Return type:

integer

check_f2f_pll_status(core_id=None, show_result=True)[source]

Check current F2F PLL lock status and for loss of lock events.

Parameters:
  • core_id (integer) – Specify which F2F Core, 0,1, or None for both cores

  • show_result (bool) – prints error counts on logger

Returns:

current status and counter values

Return type:

dict

check_f2f_soft_errors()[source]

Check F2F for soft errors. Asserted for a single user_clk period.

Returns:

soft_err register value

Return type:

integer

check_global_status_alarms()[source]

Wrapper for tpm get_global_status_alarms method Returns none if tpm version is 1.2

Returns:

alarm status dict

Return type:

dict

check_jesd_lane_error_counter(fpga_id=None, core_id=None)[source]

Check JESD204 lanes errors. Checks the FPGA link error counter register.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • core_id (integer) – Specify which JESD Core, 0,1, or None for both cores

Returns:

true if all OK

Return type:

bool

check_jesd_lane_status(fpga_id=None, core_id=None)[source]

Check JESD204 lanes errors. Checks the FPGA link error counter register.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • core_id (integer) – Specify which JESD Core, 0,1, or None for both cores

Returns:

true if all error counters are 0

Return type:

bool

Check if JESD204 lanes are synchronized. Checks the FPGA sync status register.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • core_id (integer) – Specify which JESD Core, 0,1, or None for both cores

Returns:

true if all OK

Return type:

bool

check_jesd_qpll_status(fpga_id=None, show_result=True)[source]

Check JESD204 current status and for loss of QPLL lock events. Checks the FPGA qpll lock and qpll lock loss counter registers (shared between JESD cores).

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • show_result (bool) – prints error counts on logger

Returns:

current status and counter value tuple

Return type:

dict

check_jesd_resync_counter(fpga_id=None, show_result=True)[source]

Check JESD204 for resync events. Checks the FPGA resync counter register (shared between JESD cores).

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • show_result (bool) – prints error counts on logger

Returns:

counter values

Return type:

dict

check_pps_status(fpga_id=None)[source]

Check PPS is detected and PPS period is as expected. Firmware counts number of cycles between PPS and sets an error flag if the value does not match the pps_exp_tc register.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

true if all OK

Return type:

bool

check_station_beamformer_status(fpga_id=None)[source]

Check status of Station Beamformer error flags and counters.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • show_result (bool) – prints error counts on logger

Returns:

True when Status is OK, no errors

:rtype bool

check_tile_beamformer_status(fpga_id=None)[source]

Check tile beamformer error flags. :param fpga_id: Specify which FPGA, 0,1, or None for both FPGAs :type fpga_id: integer

Returns:

True when Status is OK, no errors

:rtype bool

check_udp_arp_table_status(fpga_id=None, show_result=True)[source]

Check UDP ARP Table has been populated correctly. This is a non- destructive version of the method check_arp_table.

Parameters:

show_result (bool) – prints ARP table contents on logger

Returns:

true if each FPGA has at least one entry valid and resolved.

Return type:

bool

check_udp_bip_error_counter(fpga_id=None)[source]

Check UDP interface for BIP errors.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

counter values

Return type:

dict

check_udp_crc_error_counter(fpga_id=None)[source]

Check UDP interface for CRC errors.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

counter values

Return type:

dict

check_udp_decode_error_counter(fpga_id=None)[source]

Check UDP interface for 66b64b decoding errors.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

counter values

Return type:

dict

check_udp_linkup_loss_counter(fpga_id=None, show_result=True)[source]

Check UDP interface for linkup loss events.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • show_result (bool) – prints error counts on logger

Returns:

counter values

Return type:

dict

check_udp_status(fpga_id=None)[source]

Check for 40G errors.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

true if all OK (all error counters are 0)

Return type:

bool

clear_ad9528_pll_status()[source]

Resets the value in the AD9528 PLL lock loss counter to 0.

clear_clock_manager_status(fpga_id=None, name=None)[source]

Clear status of named TPM clock manager cores (MMCM Core). Used to reset MMCM lock loss counters.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • name (string) – Specify name of clock manager (non case sensitive)

clear_clock_status(fpga_id=None, clock_name=None)[source]

Clear status of named TPM clocks Used to Clear error flags in FPGA Firmware Options ‘jesd’, ‘ddr’, ‘udp’ Input is non case sensitive An FPGA ID can be optionally specified to only clear status on one FPGA

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • clock_name (string) – Specify name of clock or None for all clocks

clear_ddr_reset_counter(fpga_id=None)[source]

Reset value of DDR user reset counter.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

clear_f2f_pll_lock_loss_counter(core_id=None)[source]

Reset value of F2F PLL lock loss counter.

Parameters:

core_id (integer) – Specify which F2F Core, 0,1, or None for both cores

clear_health_status(group=None)[source]
clear_jesd_error_counters(fpga_id=None)[source]
Reset JESD error counters.
  • JESD Error Counter

  • JESD Resync Counter (shared between JESD cores)

  • JESD QPLL Lock Loss Counter (shared between JESD cores)

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

clear_pps_status(fpga_id=None)[source]

Clear PPS error flags.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

clear_station_beamformer_status(fpga_id=None)[source]

Clear status of Station Beamformer error flags and counters. Including DDR parity error counter.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

clear_tile_beamformer_status(fpga_id=None)[source]

Clear tile beamformer error flags. :param fpga_id: Specify which FPGA, 0,1, or None for both FPGAs :type fpga_id: integer

Returns:

True when Status is OK, no errors

:rtype bool

clear_udp_status(fpga_id=None)[source]

Reset 40G error counters.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

disable_clock_monitoring(fpga_id=None, clock_name=None)[source]

Disable clock monitoring of named TPM clocks Options ‘jesd’, ‘ddr’, ‘udp’ Input is non case sensitive An FPGA ID can be optionally specified to only disable monitoring on one FPGA

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • clock_name (string) – Specify name of clock or None for all clocks

enable_clock_monitoring(fpga_id=None, clock_name=None)[source]

Enable clock monitoring of named TPM clocks Options ‘jesd’, ‘ddr’, ‘udp’ Input is non case sensitive An FPGA ID can be optionally specified to only enable monitoring on one FPGA

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • clock_name (string) – Specify name of clock or None for all clocks

enable_health_monitoring()[source]
fpga_gen(fpga_id)[source]
get_available_clock_managers()[source]
get_available_clocks_to_monitor()[source]
Returns:

list of clock names available to be monitored

:rtype list of string

get_available_currents(fpga_id=None)[source]

Get list of available current measurements for TPM.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

TPM current names

Return type:

list

get_available_voltages(fpga_id=None)[source]

Get list of available voltage measurements for TPM.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

TPM voltage names

Return type:

list

get_current(fpga_id=None, current_name=None)[source]

Get current measurements for TPM.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • current_name (string) – Specify name of current, None for all currents

Returns:

TPM currents

Return type:

dict

get_fpga_temperature(fpga_id=None)[source]

Get FPGA temperature.

Parameters:

fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

Returns:

FPGA temperature

Return type:

dict

get_health_status(**kwargs)[source]

Returns the current value of TPM monitoring points with the specified attributes as set in the method set_monitoring_point_attr. If no arguments given, current value of all monitoring points is returned.

For example: If configured with: tile.set_monitoring_point_attr(‘io.udp_interface’, my_category=’yes’, my_other_category=87)

Subsequent calls to: tile.get_health_status(my_category=’yes’, my_other_category=87)

would return only the health status for: io.udp_interface.arp io.udp_interface.status io.udp_interface.crc_error_count.FPGA0 io.udp_interface.crc_error_count.FPGA1 io.udp_interface.bip_error_count.FPGA0 io.udp_interface.bip_error_count.FPGA1 io.udp_interface.decode_error_count.FPGA0 io.udp_interface.decode_error_count.FPGA1 io.udp_interface.linkup_loss_count.FPGA0 io.udp_interface.linkup_loss_count.FPGA1

A group attribute is provided by default, see tpm_1_X_monitoring_point_lookup. This can be used like the below example: tile.get_health_status(group=’temperatures’) tile.get_health_status(group=’udp_interface’) tile.get_health_status(group=’io’)

Full documentation on usage available at https://confluence.skatelescope.org/x/nDhED

get_voltage(fpga_id=None, voltage_name=None)[source]

Get voltage measurements for TPM.

Parameters:
  • fpga_id (integer) – Specify which FPGA, 0,1, or None for both FPGAs

  • voltage_name (string) – Specify name of voltage, None for all voltages

Returns:

TPM voltages

Return type:

dict

init_health_monitoring()[source]

Method to load monitoring point lookup dict into attribute.

TPM monitoring point format and lookup loaded from: tpm_1_2_monitoring_point_lookup.py tpm_1_6_monitoring_point_lookup.py

inject_ddr_parity_error(fpga_id=None)[source]
set_monitoring_point_attr(path, override=True, **kwargs)[source]

Specify attributes for a monitoring point or subset of monitoring points. Specified by path, a string name produced from ‘.’ delimited keys of the lookup dict. All available options returned from all_monitoring_categories().

See https://confluence.skatelescope.org/x/nDhED for example usage.

Parameters:
  • path (str) – Monitoring point path (i.e any of:’io.udp_interface.crc_error_count’, ‘io.udp_interface’, ‘timing’, ‘io’)

  • override (bool) – Overrides the specified attribute if true, if False appends

  • **kwargs

    key word args (i.e rate=’fast’ or rate=8, group=’my_group’ or group=[‘my_group1’, ‘my_group2’])

pyaavs.tile_health_monitor.health_monitoring_compatible(func)[source]

Decorator method to check if provided firmware supports TPM health monitoring. Achieved by attempting to access a register which was added for TPM health monitoring. Bitstreams generated prior to ~03/2023 will not support TPM health monitoring.